Unsupervised Learning: Trade&Ahead¶

Context¶

The stock market has consistently proven to be a good place to invest in and save for the future. There are a lot of compelling reasons to invest in stocks. It can help in fighting inflation, create wealth, and also provides some tax benefits. Good steady returns on investments over a long period of time can also grow a lot more than seems possible. Also, thanks to the power of compound interest, the earlier one starts investing, the larger the corpus one can have for retirement. Overall, investing in stocks can help meet life's financial aspirations.

It is important to maintain a diversified portfolio when investing in stocks in order to maximise earnings under any market condition. Having a diversified portfolio tends to yield higher returns and face lower risk by tempering potential losses when the market is down. It is often easy to get lost in a sea of financial metrics to analyze while determining the worth of a stock, and doing the same for a multitude of stocks to identify the right picks for an individual can be a tedious task. By doing a cluster analysis, one can identify stocks that exhibit similar characteristics and ones which exhibit minimum correlation. This will help investors better analyze stocks across different market segments and help protect against risks that could make the portfolio vulnerable to losses.

Objective¶

Trade&Ahead is a financial consultancy firm who provide their customers with personalized investment strategies. They have hired you as a Data Scientist and provided you with data comprising stock price and some financial indicators for a few companies listed under the New York Stock Exchange. They have assigned you the tasks of analyzing the data, grouping the stocks based on the attributes provided, and sharing insights about the characteristics of each group.

Data Dictionary¶

  • Ticker Symbol: An abbreviation used to uniquely identify publicly traded shares of a particular stock on a particular stock market
  • Company: Name of the company
  • GICS Sector: The specific economic sector assigned to a company by the Global Industry Classification Standard (GICS) that best defines its business operations
  • GICS Sub Industry: The specific sub-industry group assigned to a company by the Global Industry Classification Standard (GICS) that best defines its business operations
  • Current Price: Current stock price in dollars
  • Price Change: Percentage change in the stock price in 13 weeks
  • Volatility: Standard deviation of the stock price over the past 13 weeks
  • ROE: A measure of financial performance calculated by dividing net income by shareholders' equity (shareholders' equity is equal to a company's assets minus its debt)
  • Cash Ratio: The ratio of a company's total reserves of cash and cash equivalents to its total current liabilities
  • Net Cash Flow: The difference between a company's cash inflows and outflows (in dollars)
  • Net Income: Revenues minus expenses, interest, and taxes (in dollars)
  • Earnings Per Share: Company's net profit divided by the number of common shares it has outstanding (in dollars)
  • Estimated Shares Outstanding: Company's stock currently held by all its shareholders
  • P/E Ratio: Ratio of the company's current stock price to the earnings per share
  • P/B Ratio: Ratio of the company's stock price per share by its book value per share (book value of a company is the net difference between that company's total assets and total liabilities)

Importing necessary libraries and data¶

In [1]:
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd

# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style='darkgrid')

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)

# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)

# to scale the data using z-score
from sklearn.preprocessing import StandardScaler

# to compute distances
from scipy.spatial.distance import cdist, pdist

# to perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# to visualize the elbow curve and silhouette scores
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer

# to perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet

# to suppress warnings
import warnings
warnings.filterwarnings("ignore")
In [2]:
df = pd.read_csv('/Users/ritima/Desktop/Project Unsupervised Learning/stock_data.csv')
In [3]:
# copying data to another varaible to avoid any changes to original data
data = df.copy()

Data Overview¶

  • Observations
  • Sanity checks

Checking the shape of the dataset¶

In [4]:
# Checking the number of rows and columns in the train data
data.shape
Out[4]:
(340, 15)

Displaying few rows of the dataset¶

In [5]:
# Viewing a random sample of the dataset
data.sample(n=10, random_state=1)
Out[5]:
Ticker Symbol Security GICS Sector GICS Sub Industry Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio
102 DVN Devon Energy Corp. Energy Oil & Gas Exploration & Production 32.000 -15.478 2.924 205 70 830000000 -14454000000 -35.550 406582278.500 93.089 1.786
125 FB Facebook Information Technology Internet Software & Services 104.660 16.224 1.321 8 958 592000000 3669000000 1.310 2800763359.000 79.893 5.884
11 AIV Apartment Investment & Mgmt Real Estate REITs 40.030 7.579 1.163 15 47 21818000 248710000 1.520 163625000.000 26.336 -1.269
248 PG Procter & Gamble Consumer Staples Personal Products 79.410 10.661 0.806 17 129 160383000 636056000 3.280 491391569.000 24.070 -2.257
238 OXY Occidental Petroleum Energy Oil & Gas Exploration & Production 67.610 0.865 1.590 32 64 -588000000 -7829000000 -10.230 765298142.700 93.089 3.345
336 YUM Yum! Brands Inc Consumer Discretionary Restaurants 52.516 -8.699 1.479 142 27 159000000 1293000000 2.970 435353535.400 17.682 -3.838
112 EQT EQT Corporation Energy Oil & Gas Exploration & Production 52.130 -21.254 2.365 2 201 523803000 85171000 0.560 152091071.400 93.089 9.568
147 HAL Halliburton Co. Energy Oil & Gas Equipment & Services 34.040 -5.102 1.966 4 189 7786000000 -671000000 -0.790 849367088.600 93.089 17.346
89 DFS Discover Financial Services Financials Consumer Finance 53.620 3.654 1.160 20 99 2288000000 2297000000 5.140 446887159.500 10.432 -0.376
173 IVZ Invesco Ltd. Financials Asset Management & Custody Banks 33.480 7.067 1.581 12 67 412000000 968100000 2.260 428362831.900 14.814 4.219

Checking the data types of the columns for the dataset¶

In [6]:
# checking the column names and datatypes
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 340 entries, 0 to 339
Data columns (total 15 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Ticker Symbol                 340 non-null    object 
 1   Security                      340 non-null    object 
 2   GICS Sector                   340 non-null    object 
 3   GICS Sub Industry             340 non-null    object 
 4   Current Price                 340 non-null    float64
 5   Price Change                  340 non-null    float64
 6   Volatility                    340 non-null    float64
 7   ROE                           340 non-null    int64  
 8   Cash Ratio                    340 non-null    int64  
 9   Net Cash Flow                 340 non-null    int64  
 10  Net Income                    340 non-null    int64  
 11  Earnings Per Share            340 non-null    float64
 12  Estimated Shares Outstanding  340 non-null    float64
 13  P/E Ratio                     340 non-null    float64
 14  P/B Ratio                     340 non-null    float64
dtypes: float64(7), int64(4), object(4)
memory usage: 40.0+ KB

Observations¶

  • There are 340 rows and 15 columns.
  • The data types consist of all continuous float values, with 4 object type variables.
  • There are no missing values in the dataset.

Statistical summary of the dataset¶

In [7]:
# Viewing the statistical summary of the numerical columns in the data
data.describe().T
Out[7]:
count mean std min 25% 50% 75% max
Current Price 340.000 80.862 98.055 4.500 38.555 59.705 92.880 1274.950
Price Change 340.000 4.078 12.006 -47.130 -0.939 4.820 10.695 55.052
Volatility 340.000 1.526 0.592 0.733 1.135 1.386 1.696 4.580
ROE 340.000 39.597 96.548 1.000 9.750 15.000 27.000 917.000
Cash Ratio 340.000 70.024 90.421 0.000 18.000 47.000 99.000 958.000
Net Cash Flow 340.000 55537620.588 1946365312.176 -11208000000.000 -193906500.000 2098000.000 169810750.000 20764000000.000
Net Income 340.000 1494384602.941 3940150279.328 -23528000000.000 352301250.000 707336000.000 1899000000.000 24442000000.000
Earnings Per Share 340.000 2.777 6.588 -61.200 1.558 2.895 4.620 50.090
Estimated Shares Outstanding 340.000 577028337.754 845849595.418 27672156.860 158848216.100 309675137.800 573117457.325 6159292035.000
P/E Ratio 340.000 32.613 44.349 2.935 15.045 20.820 31.765 528.039
P/B Ratio 340.000 -1.718 13.967 -76.119 -4.352 -1.067 3.917 129.065
In [8]:
# Viewing the statistical summary of the numerical columns in the data
data.describe(include=[object]).T
Out[8]:
count unique top freq
Ticker Symbol 340 340 AAL 1
Security 340 340 American Airlines Group 1
GICS Sector 340 11 Industrials 53
GICS Sub Industry 340 104 Oil & Gas Exploration & Production 16

Observations¶

  • There appear to be outliers in a number of variables including Current Price, ROE, Cash Ratio, P/E Ratio and P/B Ratio.
  • The average stock price is around 80 dollars, while the maximum is 1275 dollars.
  • With respect to the percentage change in the stock price in 13 weeks, the average is around 4% while the minimum value is -47% and maximum 55%, indicating presence of outliers at either end.
  • The minimum value for Cash Ratio is 0, indicating a scenario where a company's total reserves of cash and cash equivalents were equal to its total current liabilities, i.e., the company was not cash positive.
  • There are negative values in Net Cash Flow, which means a company's cash outflows exceeded its cash inflows.
  • Again, Net Income has a negative minimum value, suggesting possible losses for a company/companies. Likewise, Earnings Per Share, which is the company's net profit/loss divided by the number of common shares it has outstanding has a minimum value of -61 dollars.
  • The average P/E Ratio is 32, while the maximum is 528, indicating outliers at the right end.
  • The average P/B Ratio, which is the ratio of a company's stock price per share by its book value per share, is around -2 and median is -1, indicating suboptimal performance of company stocks.
  • Ticker Symbol and Security are unique i.d. columns and hence should be dropped from further analysis.
  • The variable GICS Sector has 11 unique values, where the maximum frequency is 53 for the sector - Industrials. this means that out of 340 companies, 53 belong to the GICS Sector - Industrials.
  • GICS Sub Industry variable has 104 unique values, with Oil & Gas Exploration & Production having the maximum frequency of 16 (companies belonging to this sub-industry group).

Exploratory Data Analysis (EDA)¶

Questions:

  1. What does the distribution of stock prices look like?
  2. The stocks of which economic sector have seen the maximum price increase on average?
  3. How are the different variables correlated with each other?
  4. Cash ratio provides a measure of a company's ability to cover its short-term obligations using only cash and cash equivalents. How does the average cash ratio vary across economic sectors?
  5. P/E ratios can help determine the relative value of a company's shares as they signify the amount of money an investor is willing to invest in a single share of a company per dollar of its earnings. How does the P/E ratio vary, on average, across economic sectors?

Univariate analysis¶

In [9]:
# function to plot a boxplot and a histogram along the same scale.


def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="pink"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram

Distribution of Stock Prices¶

In [10]:
# Creating histogram_boxplot for 'Current Price'
histogram_boxplot(data, 'Current Price')
In [11]:
# Checking the observations which have current price more than 600 dollars
data.loc[data["Current Price"] > 600]
Out[11]:
Ticker Symbol Security GICS Sector GICS Sub Industry Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio
26 AMZN Amazon.com Inc Consumer Discretionary Internet & Direct Marketing Retail 675.890 32.268 1.460 4 58 1333000000 596000000 1.280 465625000.000 528.039 3.904
243 PCLN Priceline.com Inc Consumer Discretionary Internet & Direct Marketing Retail 1274.950 3.191 1.268 29 184 -1671386000 2551360000 50.090 50935516.070 25.453 -1.052
  • The distribution of Stock Prices is highly skewed to the right, which indicates the presence of very high stock prices.
  • As can be seen above, only one company contributes to the outlier of stock price above 1200 dollars, where the percentage change in the stock price over the last 13 weeks has been only 3.2%. However, the net cash flow (i.e., cash outflows > cash inflows) and P/B ratio are negative for this company.
  • One company has a stock price above 600 dollars and a percentage change in the stock price over the last 13 weeks of 32%.
In [12]:
# Creating histogram_boxplot for 'Price Change'
histogram_boxplot(data, 'Price Change')
In [13]:
# Creating histogram_boxplot for 'Volatility'
histogram_boxplot(data, 'Volatility')
In [14]:
# Creating histogram_boxplot for 'ROE'
histogram_boxplot(data, 'ROE')
In [15]:
# Creating histogram_boxplot for 'Cash Ratio'
histogram_boxplot(data, 'Cash Ratio')
In [16]:
# Creating histogram_boxplot for 'Net Cash Flow'
histogram_boxplot(data, 'Net Cash Flow')
In [17]:
# Creating histogram_boxplot for 'Net Income'
histogram_boxplot(data, 'Net Income')
In [18]:
# Creating histogram_boxplot for 'Earnings Per Share'
histogram_boxplot(data, 'Earnings Per Share')
In [19]:
# Creating histogram_boxplot for 'Estimated Shares Outstanding'
histogram_boxplot(data, 'Estimated Shares Outstanding')
In [20]:
# Creating histogram_boxplot for 'P/E Ratio'
histogram_boxplot(data, 'P/E Ratio')
In [21]:
# Creating histogram_boxplot for 'P/B Ratio'
histogram_boxplot(data, 'P/B Ratio')

Observations¶

Q1. What does the distribution of stock prices look like?¶

  • All variables appear to have outliers on the higher end, however Price Change, Net Cash Flow, Net Income, Earnings Per Share and P/B Ratio have outliers at the lower end as well.
  • The distribution of Current Price is highly skewed to the right, which indicates the presence of very high stock prices. The distributions of ROE and Cash Ratio exhibit a similar trend.
  • The distributions of Net Cash Flow, Net Income and Earnings Per Share have a lot of outliers, suggesting higher spread among the values.
  • All variables vary in their scale, for which the data should be scaled.
In [22]:
# selecting numerical columns
num_col = data.select_dtypes(include=np.number).columns.tolist()

fig, axes = plt.subplots(3, 2, figsize=(20, 15))
fig.suptitle("CDF plot of numerical variables", fontsize=20)
counter = 0
for ii in range(3):
    sns.ecdfplot(ax=axes[ii][0], x=df[num_col[counter]])
    counter = counter + 1
    if counter != 5:
        sns.ecdfplot(ax=axes[ii][1], x=df[num_col[counter]])
        counter = counter + 1
    else:
        pass

fig.tight_layout(pad=2.0)

Observations¶

  • 95% of the companies have current stock price less than 200 dollars.
  • The percentage change in stock price ranges from -40 to 40%, with 90% of stock prices changing less than 20%.
  • 90% of the stock prices have volatility less than 2.5.
  • 95% of stocks have an ROE and Cash Ratio of 200 or more.
In [23]:
# function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot
In [24]:
# Creating a labelled barplot for 'GICS Sector'
labeled_barplot(data, 'GICS Sector', perc=True)
In [25]:
# Creating a labelled barplot for 'GICS Sub Industry'
labeled_barplot(data, 'GICS Sub Industry', perc=True)
In [26]:
# Cross-tabulating by industry sectors and sub-industry groups
ss = pd.crosstab(index=data["GICS Sub Industry"], columns=data["GICS Sector"])
ss
Out[26]:
GICS Sector Consumer Discretionary Consumer Staples Energy Financials Health Care Industrials Information Technology Materials Real Estate Telecommunications Services Utilities
GICS Sub Industry
Advertising 2 0 0 0 0 0 0 0 0 0 0
Aerospace & Defense 0 0 0 0 0 4 0 0 0 0 0
Agricultural Products 0 1 0 0 0 0 0 0 0 0 0
Air Freight & Logistics 0 0 0 0 0 3 0 0 0 0 0
Airlines 0 0 0 0 0 5 0 0 0 0 0
Alternative Carriers 0 0 0 0 0 0 0 0 0 1 0
Apparel, Accessories & Luxury Goods 1 0 0 0 0 0 0 0 0 0 0
Application Software 0 0 0 0 0 0 2 0 0 0 0
Asset Management & Custody Banks 0 0 0 4 0 0 0 0 0 0 0
Auto Parts & Equipment 2 0 0 0 0 0 0 0 0 0 0
Automobile Manufacturers 2 0 0 0 0 0 0 0 0 0 0
Banks 0 0 0 10 0 0 0 0 0 0 0
Biotechnology 0 0 0 0 7 0 0 0 0 0 0
Brewers 0 1 0 0 0 0 0 0 0 0 0
Broadcasting & Cable TV 2 0 0 0 0 0 0 0 0 0 0
Building Products 0 0 0 0 0 4 0 0 0 0 0
Cable & Satellite 3 0 0 0 0 0 0 0 0 0 0
Casinos & Gaming 1 0 0 0 0 0 0 0 0 0 0
Computer Hardware 0 0 0 0 0 0 1 0 0 0 0
Construction & Farm Machinery & Heavy Trucks 0 0 0 0 0 3 0 0 0 0 0
Construction Materials 0 0 0 0 0 0 0 2 0 0 0
Consumer Electronics 1 0 0 0 0 0 0 0 0 0 0
Consumer Finance 0 0 0 5 0 0 0 0 0 0 0
Copper 0 0 0 0 0 0 0 1 0 0 0
Data Processing & Outsourced Services 0 0 0 0 0 0 2 0 0 0 0
Distributors 1 0 0 0 0 0 0 0 0 0 0
Diversified Chemicals 0 0 0 0 0 0 0 5 0 0 0
Diversified Commercial Services 0 0 0 0 0 1 0 0 0 0 0
Diversified Financial Services 0 0 0 7 0 0 0 0 0 0 0
Drug Retail 0 1 0 0 0 0 0 0 0 0 0
Electric Utilities 0 0 0 0 0 0 0 0 0 0 12
Electrical Components & Equipment 0 0 0 0 0 1 0 0 0 0 0
Electronic Components 0 0 0 0 0 0 2 0 0 0 0
Electronic Equipment & Instruments 0 0 0 0 0 0 1 0 0 0 0
Environmental Services 0 0 0 0 0 1 0 0 0 0 0
Fertilizers & Agricultural Chemicals 0 0 0 0 0 0 0 2 0 0 0
Financial Exchanges & Data 0 0 0 1 0 0 0 0 0 0 0
Gold 0 0 0 0 0 0 0 1 0 0 0
Health Care Distributors 0 0 0 0 3 0 0 0 0 0 0
Health Care Equipment 0 0 0 0 11 0 0 0 0 0 0
Health Care Facilities 0 0 0 0 5 0 0 0 0 0 0
Health Care Supplies 0 0 0 0 2 0 0 0 0 0 0
Home Entertainment Software 0 0 0 0 0 0 1 0 0 0 0
Home Furnishings 1 0 0 0 0 0 0 0 0 0 0
Homebuilding 2 0 0 0 0 0 0 0 0 0 0
Hotels, Resorts & Cruise Lines 4 0 0 0 0 0 0 0 0 0 0
Household Appliances 1 0 0 0 0 0 0 0 0 0 0
Household Products 0 3 0 0 0 0 0 0 0 0 0
Housewares & Specialties 1 0 0 0 0 0 0 0 0 0 0
Human Resource & Employment Services 0 0 0 0 0 1 0 0 0 0 0
IT Consulting & Other Services 0 0 0 0 0 0 3 0 0 0 0
Industrial Conglomerates 0 0 0 0 0 14 0 0 0 0 0
Industrial Gases 0 0 0 0 0 0 0 1 0 0 0
Industrial Machinery 0 0 0 0 0 5 0 0 0 0 0
Industrial Materials 0 0 0 0 0 1 0 0 0 0 0
Insurance Brokers 0 0 0 3 0 0 0 0 0 0 0
Integrated Oil & Gas 0 0 5 0 0 0 0 0 0 0 0
Integrated Telecommunications Services 0 0 0 0 0 0 0 0 0 4 0
Internet & Direct Marketing Retail 4 0 0 0 0 0 0 0 0 0 0
Internet Software & Services 0 0 0 0 0 0 12 0 0 0 0
Investment Banking & Brokerage 0 0 0 2 0 0 0 0 0 0 0
Leisure Products 2 0 0 0 0 0 0 0 0 0 0
Life & Health Insurance 0 0 0 3 0 0 0 0 0 0 0
Life Sciences Tools & Services 0 0 0 0 1 0 0 0 0 0 0
Managed Health Care 0 0 0 0 5 0 0 0 0 0 0
Metal & Glass Containers 0 0 0 0 0 0 0 1 0 0 0
Motorcycle Manufacturers 1 0 0 0 0 0 0 0 0 0 0
Multi-Sector Holdings 0 0 0 1 0 0 0 0 0 0 0
Multi-line Insurance 0 0 0 1 0 0 0 0 0 0 0
MultiUtilities 0 0 0 0 0 0 0 0 0 0 11
Networking Equipment 0 0 0 0 0 0 1 0 0 0 0
Office REITs 0 0 0 0 0 0 0 0 1 0 0
Oil & Gas Equipment & Services 0 0 3 0 0 0 0 0 0 0 0
Oil & Gas Exploration & Production 0 0 16 0 0 0 0 0 0 0 0
Oil & Gas Refining & Marketing & Transportation 0 0 6 0 0 0 0 0 0 0 0
Packaged Foods & Meats 0 6 0 0 0 0 0 0 0 0 0
Paper Packaging 0 0 0 0 0 0 0 2 0 0 0
Personal Products 0 1 0 0 0 0 0 0 0 0 0
Pharmaceuticals 0 0 0 0 6 0 0 0 0 0 0
Property & Casualty Insurance 0 0 0 8 0 0 0 0 0 0 0
Publishing 1 0 0 0 0 0 0 0 0 0 0
REITs 0 0 0 0 0 0 0 0 14 0 0
Railroads 0 0 0 0 0 4 0 0 0 0 0
Real Estate Services 0 0 0 0 0 0 0 0 1 0 0
Regional Banks 0 0 0 3 0 0 0 0 0 0 0
Research & Consulting Services 0 0 0 0 0 4 0 0 0 0 0
Residential REITs 0 0 0 0 0 0 0 0 4 0 0
Restaurants 3 0 0 0 0 0 0 0 0 0 0
Retail REITs 0 0 0 0 0 0 0 0 4 0 0
Semiconductor Equipment 0 0 0 0 0 0 1 0 0 0 0
Semiconductors 0 0 0 0 0 0 6 0 0 0 0
Soft Drinks 0 4 0 0 0 0 0 0 0 0 0
Specialized REITs 0 0 0 0 0 0 0 0 3 0 0
Specialty Chemicals 0 0 0 0 0 0 0 4 0 0 0
Specialty Retail 1 0 0 0 0 0 0 0 0 0 0
Specialty Stores 3 0 0 0 0 0 0 0 0 0 0
Steel 0 0 0 0 0 0 0 1 0 0 0
Technology Hardware, Storage & Peripherals 0 0 0 0 0 0 1 0 0 0 0
Technology, Hardware, Software and Supplies 0 0 0 0 0 1 0 0 0 0 0
Thrifts & Mortgage Finance 0 0 0 1 0 0 0 0 0 0 0
Tires & Rubber 1 0 0 0 0 0 0 0 0 0 0
Tobacco 0 2 0 0 0 0 0 0 0 0 0
Trucking 0 0 0 0 0 1 0 0 0 0 0
Water Utilities 0 0 0 0 0 0 0 0 0 0 1
In [27]:
# Creating Stacked Bar Chart for industry sector and sub-industry group
ss.plot(
    kind="bar",
    colormap="nipy_spectral",
    figsize=(22, 8),
    rot=1,
    stacked=True,
    linewidth=2,
    edgecolor="navy",
    fontsize=14,) 
plt.title("Industry Sectors and Sub-Industry Groups", fontsize=10) # Setting plot title
plt.xticks(rotation=90, fontsize=8)
plt.xlabel("GICS Sub Industry", fontsize=10)
Out[27]:
Text(0.5, 0, 'GICS Sub Industry')

Observations¶

  • The majority, close to 16% (i.e. 53 companies) of the 340 companies in the dataset belonged to the Industrials sector. This was followed by 14.4% (i.e. 49 companies) in Financials sector and 11.8% (i.e. 40 companies) each in Consumer Discretionary & Health Care sectors.
  • The least number, i.e. just 1.5% (5 companies) fell under the Telecommunications Services sector.
  • With respect to the GICS Sub-Industry groups, the 340 were spread across 104 different sub-sectors.
  • Here, the maximum companies (4.7%) belonged to the Oil & Gas Exploration & Production, followed by 4.1% in the Real Estate Investment Boards (REITs) and Industrial Conglomerates groups each and 3.5% in the Internet software & Services and Electric Utilities groups each.
  • There were as many as 42 sub-industry groups with 1 company (0.3%) in each.

Bivariate analysis¶

1. Security/Company and Current Price¶

In [28]:
# Current Price and security/company

cpc = data.groupby(["Security", "GICS Sector", "GICS Sub Industry"])["Current Price"].mean().reset_index()
cpc.sort_values('Current Price', ascending=False)
Out[28]:
Security GICS Sector GICS Sub Industry Current Price
250 Priceline.com Inc Consumer Discretionary Internet & Direct Marketing Retail 1274.950
18 Amazon.com Inc Consumer Discretionary Internet & Direct Marketing Retail 675.890
173 Intuitive Surgical Inc. Health Care Health Care Equipment 546.160
261 Regeneron Health Care Biotechnology 542.870
75 Chipotle Mexican Grill Consumer Discretionary Restaurants 479.850
... ... ... ... ...
37 Arconic Inc Industrials Aerospace & Defense 7.399
278 Southwestern Energy Energy Oil & Gas Exploration & Production 7.110
141 Freeport-McMoran Cp & Gld Materials Copper 6.770
142 Frontier Communications Telecommunications Services Integrated Telecommunications Services 4.670
73 Chesapeake Energy Energy Integrated Oil & Gas 4.500

340 rows × 4 columns

In [29]:
# Looking at the top 10 entries for Current Price and security/company

cpcc = cpc.sort_values('Current Price', ascending=False).head(10)
cpcc
Out[29]:
Security GICS Sector GICS Sub Industry Current Price
250 Priceline.com Inc Consumer Discretionary Internet & Direct Marketing Retail 1274.950
18 Amazon.com Inc Consumer Discretionary Internet & Direct Marketing Retail 675.890
173 Intuitive Surgical Inc. Health Care Health Care Equipment 546.160
261 Regeneron Health Care Biotechnology 542.870
75 Chipotle Mexican Grill Consumer Discretionary Restaurants 479.850
209 Mettler Toledo Health Care Life Sciences Tools & Services 339.130
43 BIOGEN IDEC Inc. Health Care Biotechnology 306.350
119 Equinix Real Estate REITs 302.400
14 Alliance Data Systems Information Technology Data Processing & Outsourced Services 276.570
273 Sherwin-Williams Materials Specialty Chemicals 259.600
In [30]:
# Creating barplot of current price and company/security (head)

plt.figure(figsize=(12, 6))
sns.barplot(data=cpcc, x="Security", y="Current Price", palette="viridis", ci=None) 
plt.title("Current Price by Security (Top Companies)", fontsize=14)  # Setting plot title
plt.xlabel("Security", fontsize=12)
plt.xticks(rotation=90)
plt.show()

Observations¶

  • Stock of Priceline.com Inc is currently priced the highest at ~1275 dollars
  • Amazon.com Inc is second, with a stock price of ~675 dollars
  • Intuitive Surgical Inc. and Regeneron come in next, both with a current stock price of higher than 500 dollars
  • These certainly are outlier values as the average stock price is ~80 dollars
In [31]:
# Looking at the last 10 entries for Current Price and security/company

cpcb = cpc.sort_values('Current Price', ascending=False).tail(10)
cpcb
Out[31]:
Security GICS Sector GICS Sub Industry Current Price
196 Marathon Oil Corp. Energy Oil & Gas Exploration & Production 12.590
153 HP Inc. Information Technology Computer Hardware 11.840
220 Navient Financials Consumer Finance 11.450
165 Huntington Bancshares Financials Banks 11.060
332 Xerox Corp. Information Technology IT Consulting & Other Services 10.630
37 Arconic Inc Industrials Aerospace & Defense 7.399
278 Southwestern Energy Energy Oil & Gas Exploration & Production 7.110
141 Freeport-McMoran Cp & Gld Materials Copper 6.770
142 Frontier Communications Telecommunications Services Integrated Telecommunications Services 4.670
73 Chesapeake Energy Energy Integrated Oil & Gas 4.500
In [32]:
# Creating barplot of current price and company/security (tail)

plt.figure(figsize=(12, 6))
sns.barplot(data=cpcb, x="Security", y="Current Price", palette="viridis", ci=None) 
plt.title("Current Price by Security (Bottom Securities)", fontsize=14)  # Setting plot title
plt.xlabel("Security", fontsize=12)
plt.xticks(rotation=90)
plt.show()

Observations¶

  • Stocks of Chesapeake Energy and Frontier Communications are priced the lowest at ~5 dollars
  • This is followed by Freeport-McMoran Cp & Gld, Southwestern Energy and Arconic Inc at ~7 dollars

2. Security/Company and Price Change¶

In [33]:
# Price Change and security/company

cpsec = data.groupby(["Security"])["Price Change"].mean().reset_index()
cpsec.sort_values('Price Change', ascending=False)
Out[33]:
Security Price Change
134 First Solar Inc 55.052
106 Du Pont (E.I.) 37.490
299 TripAdvisor 34.804
217 Mylan N.V. 33.177
18 Amazon.com Inc 32.268
... ... ...
141 Freeport-McMoran Cp & Gld -31.685
75 Chipotle Mexican Grill -33.131
73 Chesapeake Energy -38.102
278 Southwestern Energy -44.798
183 Kinder Morgan -47.130

340 rows × 2 columns

Observations¶

  • Stocks of First Solar Inc, Du Pont (E.I.), TripAdvisor, Mylan N.V. and Amazon.com Inc saw the maximum percentage increase in stock prices over a 13 week period
  • While stocks of Kinder Morgan, Southwestern Energy, Chesapeake Energy, Chipotle Mexican Grill and Freeport-McMoran Cp & Gld witnessed the maximum percentage decrease in stock prices over the 13 week period

3. Security/Company and Volatility¶

In [34]:
# Volatility and security/company

vs = data.groupby(["Security"])["Volatility"].mean().reset_index()
vs.sort_values('Volatility', ascending=False)
Out[34]:
Security Volatility
278 Southwestern Energy 4.580
73 Chesapeake Energy 4.560
141 Freeport-McMoran Cp & Gld 3.796
329 Wynn Resorts Ltd 3.795
327 Williams Cos. 3.720
... ... ...
262 Republic Services Inc 0.840
306 United Parcel Service 0.826
252 Procter & Gamble 0.806
243 PepsiCo Inc. 0.805
205 McDonald's Corp. 0.733

340 rows × 2 columns

Observations¶

  • Stocks of Southwestern Energy, Chesapeake Energy and Freeport-McMoran Cp & Gld witnesed the maximum volatility in stock prices over a 13 week period.
  • Stocks of McDonald's Corp., PepsiCo Inc and Procter & Gamble witnessed the least volatility in stock prices over a 13 week period.

4. Current Price as per Industry/Economic Sector & Industry-Sub Group¶

In [35]:
# Current Price and industry sector

cps = data.groupby(["GICS Sector"])["Current Price"].mean().sort_values(ascending=False)
cps
Out[35]:
GICS Sector
Health Care                   132.048
Consumer Discretionary        128.095
Real Estate                    90.977
Materials                      76.552
Industrials                    74.412
Consumer Staples               71.973
Information Technology         63.548
Financials                     58.659
Utilities                      52.969
Energy                         46.042
Telecommunications Services    32.964
Name: Current Price, dtype: float64
In [36]:
# Creating barplot of current price and industry sector

plt.figure(figsize=(12, 6))
sns.barplot(data=data, x="GICS Sector", y="Current Price", palette="viridis", ci=None) 
plt.title("Current Price by Industry Sector", fontsize=14)  # Setting plot title
plt.xlabel("GICS Sector", fontsize=12)
plt.xticks(rotation=90)
plt.show()

Observations¶

  • Stocks belonging to the Health Care sector have the highest average current price (~ 132 dollars), followed closely by the Consumer Discretionary sector (~ 128 dollars)
  • The stocks with the lowest average current price belonged to the Telecommunications Services (~ 33 dollars) and Energy (~ 46 dollars) sectors

Current Price as per Sub-Industry Group¶

In [37]:
# Current Price and sub-industry group

cpsb = data.groupby(["GICS Sub Industry"])["Current Price"].mean().sort_values(ascending=False)
cpsb
Out[37]:
GICS Sub Industry
Internet & Direct Marketing Retail                540.097
Life Sciences Tools & Services                    339.130
Biotechnology                                     221.297
Restaurants                                       216.835
Industrial Materials                              202.590
Home Furnishings                                  189.390
Data Processing & Outsourced Services             156.385
Household Appliances                              146.870
Residential REITs                                 137.980
Specialty Chemicals                               137.407
Specialty Stores                                  132.990
Managed Health Care                               129.546
Health Care Equipment                             127.925
Aerospace & Defense                               126.625
Health Care Distributors                          120.520
Construction Materials                            115.775
Office REITs                                      112.980
Industrial Gases                                  102.400
Drug Retail                                        97.770
Health Care Supplies                               97.525
Brewers                                            93.920
Asset Management & Custody Banks                   92.937
Financial Exchanges & Data                         90.600
Health Care Facilities                             90.322
Specialized REITs                                  86.927
Specialty Retail                                   85.500
REITs                                              85.035
Research & Consulting Services                     84.695
Industrial Conglomerates                           81.727
Apparel, Accessories & Luxury Goods                80.610
Multi-line Insurance                               80.540
Broadcasting & Cable TV                            80.145
Personal Products                                  79.410
Household Products                                 78.787
Cable & Satellite                                  78.333
Retail REITs                                       76.408
Hotels, Resorts & Cruise Lines                     73.845
Trucking                                           73.360
Tobacco                                            73.060
Metal & Glass Containers                           72.730
Semiconductors                                     72.092
Diversified Chemicals                              71.792
Soft Drinks                                        71.433
IT Consulting & Other Services                     69.423
Casinos & Gaming                                   69.190
Diversified Financial Services                     69.019
Air Freight & Logistics                            67.783
Internet Software & Services                       67.712
Industrial Machinery                               66.722
Property & Casualty Insurance                      65.922
Railroads                                          65.852
Packaged Foods & Meats                             65.248
Auto Parts & Equipment                             64.480
Construction & Farm Machinery & Heavy Trucks       63.877
Insurance Brokers                                  62.867
Application Software                               60.180
Water Utilities                                    59.750
Oil & Gas Refining & Marketing & Transportation    58.097
Pharmaceuticals                                    55.098
Life & Health Insurance                            55.090
Airlines                                           54.782
Alternative Carriers                               54.360
Banks                                              53.839
Electrical Components & Equipment                  53.590
MultiUtilities                                     53.498
Environmental Services                             53.370
Electric Utilities                                 51.918
Advertising                                        49.470
Integrated Oil & Gas                               48.668
Building Products                                  47.635
Consumer Finance                                   47.442
Leisure Products                                   47.265
Diversified Commercial Services                    47.220
Human Resource & Employment Services               47.140
Motorcycle Manufacturers                           45.390
Housewares & Specialties                           44.080
Oil & Gas Exploration & Production                 42.229
Paper Packaging                                    41.150
Steel                                              40.300
Home Entertainment Software                        38.710
Oil & Gas Equipment & Services                     37.893
Consumer Electronics                               37.170
Agricultural Products                              36.680
Electronic Components                              35.255
Real Estate Services                               34.580
Fertilizers & Agricultural Chemicals               34.200
Homebuilding                                       33.365
Tires & Rubber                                     32.670
Regional Banks                                     31.773
Investment Banking & Brokerage                     31.285
Distributors                                       29.630
Electronic Equipment & Instruments                 28.070
Integrated Telecommunications Services             27.615
Networking Equipment                               27.600
Publishing                                         25.520
Automobile Manufacturers                           24.050
Technology, Hardware, Software and Supplies        20.650
Semiconductor Equipment                            18.670
Gold                                               17.990
Multi-Sector Holdings                              17.390
Thrifts & Mortgage Finance                         16.150
Technology Hardware, Storage & Peripherals         15.200
Computer Hardware                                  11.840
Copper                                              6.770
Name: Current Price, dtype: float64
In [38]:
# Creating barplot of current price and sub-industry group

plt.figure(figsize=(28, 6))
sns.barplot(data=data, x="GICS Sub Industry", y="Current Price", palette="viridis", ci=None) 
plt.title("Current Price by Industry Sector", fontsize=14)  # Setting plot title
plt.xlabel("GICS Sub Industry", fontsize=12)
plt.xticks(rotation=90)
plt.show()

Observations¶

  • With respect to the sub-industry groups, stocks belonging to the Internet & Direct Marketing Retail sub-group (Consumer Discretionary industry sector) have the highest average current price (~ 540 dollars)

  • Next is the Life Sciences Tools & Services sub-group and Biotechnology sub-group at ~ 339 dollars and ~ 221 dollars respectively, both of which belong to the Health Care industry group

  • Restaurants sub-group (Consumer Discretionary industry sector) comes in next with an average current stock price of ~ 217 dollars

  • On the other hand, the stocks of Copper sub-group (Materials industry sector) have the lowest average current stock price of ~7 dollars

  • Next on the lowest average current price are stocks of the Computer Hardware and Technology Hardware, Storage & Peripherals sub-groups at ~ 12 dollars and ~ 15 dollars respectively, both belonging to the Information Technology industry sector

  • Thrifts & Mortgage Finance and Multi-Sector Holdings sub-groups (both belonging to the Financials industry sector) come in next with average current stock prices of ~ 16 dollars and ~ 17 dollars respectively

5. Price Change as per Industry/Economic Sector & Industry-Sub Group¶

In [39]:
# Price change and industry sector

pcs = data.groupby(["GICS Sector"])["Price Change"].mean().sort_values(ascending=False)
pcs
Out[39]:
GICS Sector
Health Care                     9.586
Consumer Staples                8.685
Information Technology          7.217
Telecommunications Services     6.957
Real Estate                     6.206
Consumer Discretionary          5.846
Materials                       5.590
Financials                      3.865
Industrials                     2.833
Utilities                       0.804
Energy                        -10.228
Name: Price Change, dtype: float64
In [40]:
# Creating lineplot of price change and industry sector

plt.figure(figsize=(12, 6))
sns.lineplot(data=data, x="GICS Sector", y="Price Change", palette="viridis") 
plt.title("Price Change by Industry Sector", fontsize=14)  # Setting plot title
plt.xlabel("GICS Sector", fontsize=12)
plt.xticks(rotation=90)
plt.show()

Price Change as per Sub-Industry Group¶

In [41]:
# Price change and sub-industry group

pcsb = data.groupby(["GICS Sub Industry"])["Price Change"].mean().sort_values(ascending=False)
pcsb
Out[41]:
GICS Sub Industry
Casinos & Gaming                                   29.497
Semiconductor Equipment                            26.834
Alternative Carriers                               24.708
Home Entertainment Software                        23.320
Life Sciences Tools & Services                     18.943
Internet & Direct Marketing Retail                 18.789
Advertising                                        18.316
Metal & Glass Containers                           16.536
Health Care Distributors                           16.108
Diversified Chemicals                              15.593
Specialty Chemicals                                15.436
Semiconductors                                     14.437
Investment Banking & Brokerage                     14.060
Publishing                                         13.624
Biotechnology                                      13.497
Building Products                                  13.298
Specialized REITs                                  13.243
Brewers                                            13.129
Data Processing & Outsourced Services              11.823
Leisure Products                                   11.774
Packaged Foods & Meats                             11.604
Pharmaceuticals                                    11.525
Gold                                               10.844
Diversified Commercial Services                    10.819
Personal Products                                  10.661
Tires & Rubber                                     10.446
Soft Drinks                                        10.434
Housewares & Specialties                            9.980
Airlines                                            9.502
Health Care Equipment                               9.331
Tobacco                                             8.607
Water Utilities                                     8.597
Real Estate Services                                8.198
Auto Parts & Equipment                              7.790
Household Products                                  7.780
Residential REITs                                   7.707
Networking Equipment                                7.351
Automobile Manufacturers                            7.340
Property & Casualty Insurance                       7.145
Broadcasting & Cable TV                             7.144
Environmental Services                              7.061
Steel                                               6.586
Managed Health Care                                 6.219
Banks                                               6.083
Internet Software & Services                        5.993
Retail REITs                                        5.906
Hotels, Resorts & Cruise Lines                      5.348
Health Care Supplies                                5.115
Electronic Components                               4.641
Distributors                                        4.441
REITs                                               4.369
Research & Consulting Services                      4.206
Diversified Financial Services                      4.203
Aerospace & Defense                                 4.136
Office REITs                                        4.004
Technology, Hardware, Software and Supplies         3.821
Regional Banks                                      3.672
Home Furnishings                                    3.514
Consumer Electronics                                3.394
Industrial Conglomerates                            3.275
Thrifts & Mortgage Finance                          3.129
Insurance Brokers                                   3.109
Cable & Satellite                                   3.066
Trucking                                            2.961
Application Software                                2.572
Integrated Telecommunications Services              2.519
Specialty Stores                                    2.449
Electrical Components & Equipment                   2.212
Computer Hardware                                   2.162
Multi-line Insurance                                1.898
Life & Health Insurance                             1.854
Health Care Facilities                              1.712
Drug Retail                                         1.327
Specialty Retail                                    1.303
MultiUtilities                                      1.176
Asset Management & Custody Banks                    0.958
Industrial Gases                                    0.294
Electronic Equipment & Instruments                  0.214
Industrial Machinery                                0.010
IT Consulting & Other Services                     -0.157
Electric Utilities                                 -0.187
Household Appliances                               -0.231
Construction & Farm Machinery & Heavy Trucks       -0.605
Consumer Finance                                   -0.838
Homebuilding                                       -1.930
Oil & Gas Refining & Marketing & Transportation    -2.370
Financial Exchanges & Data                         -2.402
Construction Materials                             -2.419
Paper Packaging                                    -2.587
Industrial Materials                               -5.336
Air Freight & Logistics                            -5.418
Railroads                                          -6.407
Integrated Oil & Gas                               -6.955
Restaurants                                        -7.297
Human Resource & Employment Services               -7.659
Oil & Gas Equipment & Services                     -9.991
Fertilizers & Agricultural Chemicals              -10.240
Agricultural Products                             -12.017
Oil & Gas Exploration & Production                -14.242
Multi-Sector Holdings                             -14.293
Apparel, Accessories & Luxury Goods               -16.948
Motorcycle Manufacturers                          -17.247
Technology Hardware, Storage & Peripherals        -17.838
Copper                                            -31.685
Name: Price Change, dtype: float64
In [42]:
# Creating lineplot of price change and sub-industry group

plt.figure(figsize=(28, 6))
sns.lineplot(data=data, x="GICS Sub Industry", y="Price Change", palette="viridis") 
plt.title("Price Change by Sub-Industry Group", fontsize=14)  # Setting plot title
plt.xlabel("GICS Sub Industry", fontsize=14)
plt.xticks(rotation=90)
plt.show()

Observations¶

Q2. The stocks of which economic sector have seen the maximum price increase on average?¶

Industry Sector¶
  • Stocks belonging to the Health Care sector have the highest percentage price change (increase) of ~ 9.5%, followed by the Consumer Staples sector at ~ 8.7%
  • This was followed by Information Technology, Telecommunications Services and Real Estate sectors with percentage price change of ~ 7.2%, ~ 6.9% and ~ 6.2% respectively
  • On the negative end, the Energy sector saw a major price change (decrease) to the tune of ~ -10.2%
Sub-Industry Group¶
  • With respect to the sub-industry groups, stocks belonging to the Casinos & Gaming sub-group witnessed the highest percentage price change (increase) of ~ 29.5%, which is expected from this sector.
  • This was followed by Semiconductor Equipment, Alternative Carriers and Home Entertainment Software sub-sectors at ~ 27%, ~ 25% and ~ 23% respectively.
  • On the negative end, the Copper sub-group (Materials industry group) saw a major price change (decrease) to the tune of ~ -32%, followed by Technology Hardware, Storage & Peripherals, Motorcycle Manufacturers and Apparel, Accessories & Luxury Goods at ~ 18%, ~17.2% and ~17% respectively.

6. Volatility as per Industry/Economic Sector¶

In [43]:
# Creating lineplot of volatility and industry sector

plt.figure(figsize=(12, 6))
sns.lineplot(data=data, x="GICS Sector", y="Volatility", palette="viridis") 
plt.title("Volatility by Industry Sector", fontsize=14)  # Setting plot title
plt.xlabel("GICS Sector", fontsize=12)
plt.xticks(rotation=90)
plt.show()

Volatility as per Sub-Industry Group¶

In [44]:
# Creating lineplot of volatility and sub-industry group

plt.figure(figsize=(28, 6))
sns.lineplot(data=data, x="GICS Sub Industry", y="Volatility", palette="viridis") 
plt.title("Volatility by Sub-Industry Group", fontsize=14)  # Setting plot title
plt.xlabel("GICS Sub Industry", fontsize=14)
plt.xticks(rotation=90)
plt.show()

Observations¶

Industry Sector¶
  • Stocks belonging to the Energy sector have the highest volatility, of ~ 2.6, in the stock prices over the past 13 weeks
  • This was followed by Materials, Information Technology, and Consumer Discretionary at 1.8, 1.7 and 1.6 respectively
  • Utilities and Consumer Staples exhibited the least volatility in stock prices over the past 13 weeks
Sub-Industry Group¶
  • With respect to the sub-industry groups, stocks belonging to the Casinos & Gaming and Copper sub-groups witnessed the highest volatility, of ~ 3.8, in the stock prices over the past 13 weeks
  • This was followed by Technology Hardware, Storage & Peripherals and Oil & Gas Exploration & Production at ~3.4 and ~ 2.9 respectively.
  • Personal Products (Consumer Staples industry group) exhibited the least volatility in stock prices over the past 13 weeks

7. Cash Ratio as per Industry/Economic Sector¶

In [45]:
# Cash ratio and industry sector

crs = data.groupby(["GICS Sector"])["Cash Ratio"].mean().sort_values(ascending=False)
crs
Out[45]:
GICS Sector
Information Technology        149.818
Telecommunications Services   117.000
Health Care                   103.775
Financials                     98.592
Consumer Staples               70.947
Energy                         51.133
Real Estate                    50.111
Consumer Discretionary         49.575
Materials                      41.700
Industrials                    36.189
Utilities                      13.625
Name: Cash Ratio, dtype: float64
In [46]:
# Creating pointplot of cash ratio and industry sector

plt.figure(figsize=(12, 6))
sns.pointplot(data=data, x="GICS Sector", y="Cash Ratio", palette="viridis") 
plt.title("Cash Ratio by Industry Sector", fontsize=14)  # Setting plot title
plt.xlabel("GICS Sector", fontsize=12)
plt.xticks(rotation=90)
plt.show()

Cash Ratio as per Sub-Industry Group¶

In [47]:
# Cash ratio and sub-industry group

crsb = data.groupby(["GICS Sub Industry"])["Cash Ratio"].mean().sort_values(ascending=False)
crsb
Out[47]:
GICS Sub Industry
Biotechnology                                     217.286
Internet Software & Services                      205.417
Gold                                              198.000
Casinos & Gaming                                  198.000
Soft Drinks                                       192.000
Semiconductors                                    189.667
Restaurants                                       174.667
Electronic Components                             169.000
Health Care Distributors                          160.333
Steel                                             147.000
Application Software                              134.000
Integrated Telecommunications Services            131.000
Semiconductor Equipment                           131.000
Personal Products                                 129.000
Consumer Electronics                              121.000
Internet & Direct Marketing Retail                120.750
Diversified Financial Services                    117.000
Health Care Equipment                             108.364
Networking Equipment                              108.000
Oil & Gas Equipment & Services                    107.333
Multi-line Insurance                               99.000
Insurance Brokers                                  99.000
Financial Exchanges & Data                         99.000
Investment Banking & Brokerage                     99.000
Life & Health Insurance                            99.000
Thrifts & Mortgage Finance                         99.000
Regional Banks                                     99.000
Banks                                              99.000
Consumer Finance                                   99.000
Property & Casualty Insurance                      99.000
Multi-Sector Holdings                              81.000
Electronic Equipment & Instruments                 81.000
IT Consulting & Other Services                     77.333
Diversified Commercial Services                    73.000
Leisure Products                                   73.000
Home Entertainment Software                        70.000
Managed Health Care                                70.000
Asset Management & Custody Banks                   67.000
Construction Materials                             63.500
Pharmaceuticals                                    62.500
Alternative Carriers                               61.000
REITs                                              57.214
Oil & Gas Exploration & Production                 52.812
Construction & Farm Machinery & Heavy Trucks       52.333
Air Freight & Logistics                            50.000
Agricultural Products                              49.000
Airlines                                           47.800
Industrial Machinery                               47.800
Residential REITs                                  47.000
Retail REITs                                       47.000
Office REITs                                       47.000
Railroads                                          46.000
Technology Hardware, Storage & Peripherals         45.000
Fertilizers & Agricultural Chemicals               43.500
Building Products                                  43.000
Integrated Oil & Gas                               40.400
Specialized REITs                                  39.000
Automobile Manufacturers                           38.000
Packaged Foods & Meats                             37.667
Electrical Components & Equipment                  37.000
Brewers                                            35.000
Technology, Hardware, Software and Supplies        34.000
Human Resource & Employment Services               34.000
Health Care Supplies                               31.500
Diversified Chemicals                              31.200
Motorcycle Manufacturers                           31.000
Health Care Facilities                             30.200
Tires & Rubber                                     30.000
Industrial Conglomerates                           28.929
Oil & Gas Refining & Marketing & Transportation    27.500
Tobacco                                            27.500
Apparel, Accessories & Luxury Goods                27.000
Household Products                                 25.000
Homebuilding                                       25.000
Data Processing & Outsourced Services              25.000
Broadcasting & Cable TV                            24.500
Paper Packaging                                    23.000
Aerospace & Defense                                22.750
Research & Consulting Services                     21.000
Publishing                                         21.000
Auto Parts & Equipment                             19.500
Advertising                                        19.000
Life Sciences Tools & Services                     18.000
Computer Hardware                                  18.000
Cable & Satellite                                  16.667
MultiUtilities                                     16.273
Industrial Materials                               16.000
Housewares & Specialties                           14.000
Specialty Chemicals                                12.500
Real Estate Services                               12.000
Electric Utilities                                 12.000
Distributors                                       12.000
Drug Retail                                        11.000
Hotels, Resorts & Cruise Lines                     10.250
Metal & Glass Containers                           10.000
Household Appliances                               10.000
Specialty Retail                                   10.000
Industrial Gases                                    8.000
Copper                                              5.000
Water Utilities                                     4.000
Specialty Stores                                    3.333
Home Furnishings                                    3.000
Environmental Services                              2.000
Trucking                                            1.000
Name: Cash Ratio, dtype: float64
In [48]:
# Creating pointplot of cash ratio and sub-industry group

plt.figure(figsize=(28, 6))
sns.pointplot(data=data, x="GICS Sub Industry", y="Cash Ratio", palette="viridis") 
plt.title("Cash Ratio by Sub-Industry Group", fontsize=14)  # Setting plot title
plt.xlabel("GICS Sub Industry", fontsize=14)
plt.xticks(rotation=90)
plt.show()

Observations¶

Q3. How does the average cash ratio vary across economic sectors?¶

Industry Sector¶
  • Stocks belonging to the Information Technology sector have the highest cash ratio of ~ 150. This suggests that companies belonging to this sector have the ability to cover their short-term obligations using only cash and cash equivalents
  • This was followed by Telecommunications Services and Health Care at ~ 117 and ~ 104 respectively
  • Utilities and Industrials sectors have the lowest cash ratios of 14 and 36 respectively, which indicates that majority of companies in these sectors have low liquidity
Sub-Industry Group¶
  • With respect to the sub-industry groups, stocks belonging to the Biotachnology sub-group have the highest cash ratio of ~ 217, followed closely by Internet Software & Services at ~ 205
  • This was followed by Casinos & Gaming and Gold sub-groups at ~ 198 each and Soft Drinks at ~ 192
  • Trucking sub-group (Industrials sector) had the lowest cash ratio of ~ 1, Environmental Services stood at ~ 2, whereas it was ~ 3 each for Specality Stores and Home Furnishings sub-groups

8. Net Income as per Industry/Economic Sector¶

In [49]:
# Creating barplot of net income and industry sector

plt.figure(figsize=(12, 6))
sns.barplot(data=data, x="GICS Sector", y="Net Income", palette="viridis", ci=None) 
plt.title("Net Income by Industry Sector", fontsize=14)  # Setting plot title
plt.xlabel("GICS Sector", fontsize=12)
plt.xticks(rotation=90)
plt.show()

Net Income as per Sub-Industry Group¶

In [50]:
# Creating barplot of net income and sub-industry group

plt.figure(figsize=(28, 6))
sns.barplot(data=data, x="GICS Sub Industry", y="Net Income", palette="viridis", ci=None) 
plt.title("Net Income by Sub-Industry Group", fontsize=14)  # Setting plot title
plt.xlabel("GICS Sub Industry", fontsize=12)
plt.xticks(rotation=90)
plt.show()

Observations¶

Industry Sector¶
  • Stocks belonging to the Telecommunications Services sector have the highest net income, followed by Financials, Consumer Staples and Health Care sectors
  • The Energy sector exhibited negative net income, signifying potential losses
Sub-Industry Group¶
  • With respect to the sub-industry groups, stocks belonging to the Banks sub-group have the highest net income, followed by Automobile Manufacturers and Integrated Telecommunications Services
  • Copper, Oil & Gas Exploration & Production and Oil & Gas Equipment & Services are sub-groups that witnessed net losses
  • It is interesting to note that net losses in Copper are the highest amongst all the sub-industry groups, however, the Materials sector to which it belongs has a positive net income, signifying good performance of other sub-groups in the materials sector

9. Earnings Per Share as per Industry/Economic Sector¶

In [51]:
# Creating lineplot of Earnings Per Share and industry sector

plt.figure(figsize=(12, 6))
sns.lineplot(data=data, x="GICS Sector", y="Earnings Per Share", palette="viridis") 
plt.title("Earnings Per Share by Industry Sector", fontsize=14)  # Setting plot title
plt.xlabel("GICS Sector", fontsize=12)
plt.xticks(rotation=90)
plt.show()

Earnings Per Share as per Sub-Industry Group¶

In [52]:
# Creating lineplot of Earnings Per Share and sub-industry group

plt.figure(figsize=(28, 6))
sns.lineplot(data=data, x="GICS Sub Industry", y="Earnings Per Share", palette="viridis") 
plt.title("Earnings Per Share by Sub-Industry Group", fontsize=16)  # Setting plot title
plt.xlabel("GICS Sub Industry", fontsize=14)
plt.xticks(rotation=90)
plt.show()

Observations¶

Industry Sector¶
  • Stocks belonging to the Health Care and Consumer Discretionary sectors have the highest EPS, i.e., the stocks of companies in this sector are profitable
  • The Energy sector exhibited negative EPS, in line with the negative net income as noted above
Sub-Industry Group¶
  • With respect to the sub-industry groups, stocks with negative net incomes (as noted above in point 8) also exbihit a negative EPS

10. P/E Ratio as per Industry/Economic Sector¶

In [53]:
# P/E ratio and industry sector

pes = data.groupby(["GICS Sector"])["P/E Ratio"].mean().sort_values(ascending=False)
pes
Out[53]:
GICS Sector
Energy                        72.898
Information Technology        43.783
Real Estate                   43.066
Health Care                   41.135
Consumer Discretionary        35.212
Consumer Staples              25.521
Materials                     24.585
Utilities                     18.719
Industrials                   18.259
Financials                    16.023
Telecommunications Services   12.223
Name: P/E Ratio, dtype: float64
In [54]:
# Creating barplot of P/E ratio and industry sector

plt.figure(figsize=(12, 6))
sns.boxplot(data=data, x="GICS Sector", y="P/E Ratio") 
plt.title("P/E Ratio by Industry Sector", fontsize=14)  # Setting plot title
plt.xlabel("GICS Sector", fontsize=12)
plt.xticks(rotation=90)
plt.show()

P/E Ratio as per Sub-Industry Group¶

In [55]:
# P/E ratio and industry sector

pesb = data.groupby(["GICS Sub Industry"])["P/E Ratio"].mean().sort_values(ascending=False)
pesb
Out[55]:
GICS Sub Industry
Internet & Direct Marketing Retail                159.111
Office REITs                                      110.765
Oil & Gas Exploration & Production                 93.518
Oil & Gas Equipment & Services                     93.089
Application Software                               74.556
Biotechnology                                      73.319
Home Furnishings                                   73.124
Internet Software & Services                       61.536
Specialized REITs                                  57.526
Brewers                                            48.412
Oil & Gas Refining & Marketing & Transportation    44.889
Construction Materials                             44.450
Health Care Equipment                              43.766
Semiconductors                                     43.431
Retail REITs                                       42.248
Gold                                               41.837
Health Care Distributors                           41.387
Multi-line Insurance                               38.721
REITs                                              38.444
Residential REITs                                  37.783
Pharmaceuticals                                    36.971
Steel                                              36.306
Casinos & Gaming                                   35.850
Metal & Glass Containers                           35.478
Household Products                                 34.298
Health Care Supplies                               33.994
Housewares & Specialties                           33.908
Data Processing & Outsourced Services              33.620
Environmental Services                             32.151
Home Entertainment Software                        31.992
Investment Banking & Brokerage                     31.940
Building Products                                  29.978
Tires & Rubber                                     28.658
Integrated Oil & Gas                               28.408
Specialty Retail                                   28.218
Research & Consulting Services                     27.041
Life Sciences Tools & Services                     26.598
Computer Hardware                                  25.310
Technology Hardware, Storage & Peripherals         25.310
Soft Drinks                                        25.268
Specialty Chemicals                                24.578
Restaurants                                        24.518
Financial Exchanges & Data                         24.420
Personal Products                                  24.070
Multi-Sector Holdings                              23.500
Health Care Facilities                             23.150
Hotels, Resorts & Cruise Lines                     23.088
Copper                                             22.812
Water Utilities                                    22.462
Packaged Foods & Meats                             22.253
Paper Packaging                                    22.059
Leisure Products                                   21.908
Electrical Components & Equipment                  21.785
Real Estate Services                               21.085
Drug Retail                                        20.981
Tobacco                                            20.845
Distributors                                       20.820
Cable & Satellite                                  20.820
Apparel, Accessories & Luxury Goods                20.820
Specialty Stores                                   20.281
Trucking                                           19.881
Electronic Components                              19.534
IT Consulting & Other Services                     19.333
Advertising                                        19.026
Industrial Gases                                   18.998
Electric Utilities                                 18.996
Insurance Brokers                                  18.968
Managed Health Care                                18.888
Thrifts & Mortgage Finance                         18.779
Industrial Conglomerates                           18.370
MultiUtilities                                     18.077
Air Freight & Logistics                            18.047
Regional Banks                                     17.974
Aerospace & Defense                                17.871
Human Resource & Employment Services               17.331
Industrial Materials                               17.330
Networking Equipment                               17.037
Diversified Commercial Services                    16.568
Semiconductor Equipment                            16.522
Broadcasting & Cable TV                            16.513
Auto Parts & Equipment                             16.385
Diversified Chemicals                              16.263
Electronic Equipment & Instruments                 16.225
Diversified Financial Services                     15.553
Consumer Electronics                               15.552
Asset Management & Custody Banks                   15.478
Property & Casualty Insurance                      15.313
Industrial Machinery                               15.156
Railroads                                          15.147
Construction & Farm Machinery & Heavy Trucks       14.842
Household Appliances                               14.761
Integrated Telecommunications Services             13.879
Banks                                              13.097
Homebuilding                                       12.776
Publishing                                         12.510
Agricultural Products                              12.268
Motorcycle Manufacturers                           12.235
Fertilizers & Agricultural Chemicals               11.815
Life & Health Insurance                            11.407
Technology, Hardware, Software and Supplies        10.123
Consumer Finance                                    9.305
Airlines                                            8.161
Automobile Manufacturers                            6.571
Alternative Carriers                                5.598
Name: P/E Ratio, dtype: float64
In [56]:
# Creating barplot of P/E ratio and sub-industry group

plt.figure(figsize=(28, 6))
sns.barplot(data=data, x="GICS Sub Industry", y="P/E Ratio", ci=None) 
plt.title("P/E Ratio by Sub-Industry Group", fontsize=16)  # Setting plot title
plt.xlabel("GICS Sub Industry", fontsize=14)
plt.xticks(rotation=90)
plt.show()

Observations¶

Q4. P/E ratios can help determine the relative value of a company's shares as they signify the amount of money an investor is willing to invest in a single share of a company per dollar of its earnings. How does the P/E ratio vary, on average, across economic sectors?¶

Industry Sector¶
  • Stocks belonging to the Energy sector have the highest P/E ratio and also the highest variance in P/E ratios, as indicated by the whiskers of the boxplot. P/E ratio is a measure of valuation
  • This suggests an investors' behaviour of giving more precedance to this sector, i.e., they are willing to invest more in a single share of a company/security in this sector per dollar of its earnings as opposed to securities in other industry sectors
  • The P/E ratios of Telecommunication Services and Financials are the lowest, suggesting lower valuation
Sub-Industry Group¶
  • With respect to the sub-industry groups, stocks belonging to the Internet & Direct Marketing Retail (belonging to Consumer Discretionary sector) have the highest P/E ratio
  • This was followed by Office REITs, Oil & Gas Exploration & Production and Oil & Gas Equipment & Services sub-industry groups, again indicating investor willingness to invest in shares of these sub-groups than in shares of other sub-groups

11. P/B Ratio as per Industry/Economic Sector¶

In [57]:
# Creating barplot of P/B ratio and industry sector

plt.figure(figsize=(12, 6))
sns.barplot(data=data, x="GICS Sector", y="P/B Ratio", palette="viridis", ci=None) 
plt.title("P/B Ratio by Industry Sector", fontsize=14)  # Setting plot title
plt.xlabel("GICS Sector", fontsize=12)
plt.xticks(rotation=90)
plt.show()

P/B Ratio as per Sub-Industry Group¶

In [58]:
# Creating barplot of P/B ratio and sub-industry group

plt.figure(figsize=(28, 6))
sns.barplot(data=data, x="GICS Sub Industry", y="P/B Ratio", palette="viridis", ci=None) 
plt.title("P/B Ratio by Sub-Industry Group", fontsize=16)  # Setting plot title
plt.xlabel("GICS Sub Industry", fontsize=14)
plt.xticks(rotation=90)
plt.show()

Observations¶

Industry Sector¶
  • The majority of sectors, including Telecommunications Services, Consumer Discretionary, Consumer Staples and Financials exhibited a negative P/B ratio, i.e., a negative ratio of the company's stock price per share by its book value per share, which indicates undervalued stock
  • Sectors including Information Technology, Energy and Materials have a positive P/B ratio, suggesting that the stock price is trading at a premium to the company's book value.
Sub-Industry Group¶
  • With respect to the sub-industry groups, stocks belonging to the Data Processing & Outsourced Services sub-group had the highest P/B ratio, while Cable & Satellite, Financial Exchanges & Data and Household Appliances have negative P/B ratios.

Let's check for correlations¶

In [59]:
sns.pairplot(data=data[num_col], diag_kind="kde")
plt.show()
In [60]:
plt.figure(figsize=(15, 7))
sns.heatmap(data[num_col].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="viridis_r")
plt.show()

Observations¶

  • As is visible form the kde plots, while most of the variable are normally distributed, the distributions for P/E Ratio and Earnings Per Share appear to be bimodal and that of Volatility is trimodal
  • Earnings Per Share has a positive correlation with Current Price (0.48) & Net Income (0.56)
  • Earnings Per Share has a negative correlation with Volatility (-0.38) & ROE (-0.41)
  • Net Income has a positive correlation with Estimated Shares Outstanding (0.59)
  • Volatility has a negative correlation with Price Change (-0.41), signifying that the more volatile the stocks, the more negative the Price Change

Data Preprocessing¶

  • Duplicate value check
  • Missing value treatment
  • Outlier check
  • Scaling

Checking for duplicates and missing values¶

In [61]:
# checking for duplicate values
data.duplicated().sum()
Out[61]:
0
In [62]:
# checking for missing values in the data
data.isnull().sum()
Out[62]:
Ticker Symbol                   0
Security                        0
GICS Sector                     0
GICS Sub Industry               0
Current Price                   0
Price Change                    0
Volatility                      0
ROE                             0
Cash Ratio                      0
Net Cash Flow                   0
Net Income                      0
Earnings Per Share              0
Estimated Shares Outstanding    0
P/E Ratio                       0
P/B Ratio                       0
dtype: int64
  • These are no duplicated or missing values in the dataset.

Outlier Check¶

  • Let's plot the boxplots of all numerical columns to check for outliers.
In [63]:
plt.figure(figsize=(15, 12))

for i, variable in enumerate(num_col):
    plt.subplot(3, 4, i + 1)
    plt.boxplot(df[variable], whis=1.5)
    plt.tight_layout()
    plt.title(variable)

plt.show()

Observations¶

  • Outliers, although present, appear to be actual data points. Therefore, we are not treating them

Scaling¶

  • Let's scale the data before we proceed with clustering
In [64]:
# Scaling the data set before clustering
scaler = StandardScaler()
subset = data[num_col].copy()
subset_scaled = scaler.fit_transform(subset)
In [65]:
# creating a dataframe of the scaled columns
subset_scaled_df = pd.DataFrame(subset_scaled, columns=subset.columns)
In [66]:
# Checking for scaled values
subset_scaled_df.head(10)
Out[66]:
Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio
0 -0.393 0.494 0.273 0.990 -0.211 -0.339 1.554 1.309 0.108 -0.652 -0.507
1 -0.221 0.355 1.137 0.938 0.077 -0.002 0.928 0.057 1.250 -0.312 -0.504
2 -0.367 0.602 -0.427 -0.193 -0.033 0.454 0.744 0.025 1.098 -0.392 0.095
3 0.134 0.826 -0.285 -0.317 1.218 -0.152 -0.220 -0.231 -0.092 0.947 0.424
4 -0.261 -0.493 0.296 -0.266 2.237 0.134 -0.203 -0.375 1.978 3.293 0.199
5 -0.451 -1.343 -0.016 -0.307 -0.233 -0.126 0.090 0.032 0.049 -0.459 0.661
6 1.999 0.176 -0.692 -0.100 -0.499 0.018 -0.228 0.932 -0.604 -0.036 9.378
7 -0.384 -0.159 -0.680 -0.317 -0.620 0.119 -0.218 -0.027 -0.394 -0.361 0.072
8 -0.231 -0.142 -0.774 -0.297 -0.676 -0.021 0.142 0.054 -0.184 -0.320 -0.094
9 -0.214 -0.088 -0.808 -0.266 0.321 -0.187 0.264 0.472 -0.173 -0.506 -0.012

Observations¶

  • It has be seem that post the scalar transformation, all variables/attributes are on the same scale, with a mean of 0 and standard deviation of 1
  • The relationships between the attributes have, however, been maintained

EDA¶

  • It is a good idea to explore the data once again after manipulating it

We won't need Ticker Symbol and Security for analysis, so let's drop these columns¶

In [67]:
data.drop(["Ticker Symbol"], axis=1, inplace=True)
In [68]:
data.describe(include="all").T
Out[68]:
count unique top freq mean std min 25% 50% 75% max
Security 340 340 American Airlines Group 1 NaN NaN NaN NaN NaN NaN NaN
GICS Sector 340 11 Industrials 53 NaN NaN NaN NaN NaN NaN NaN
GICS Sub Industry 340 104 Oil & Gas Exploration & Production 16 NaN NaN NaN NaN NaN NaN NaN
Current Price 340.000 NaN NaN NaN 80.862 98.055 4.500 38.555 59.705 92.880 1274.950
Price Change 340.000 NaN NaN NaN 4.078 12.006 -47.130 -0.939 4.820 10.695 55.052
Volatility 340.000 NaN NaN NaN 1.526 0.592 0.733 1.135 1.386 1.696 4.580
ROE 340.000 NaN NaN NaN 39.597 96.548 1.000 9.750 15.000 27.000 917.000
Cash Ratio 340.000 NaN NaN NaN 70.024 90.421 0.000 18.000 47.000 99.000 958.000
Net Cash Flow 340.000 NaN NaN NaN 55537620.588 1946365312.176 -11208000000.000 -193906500.000 2098000.000 169810750.000 20764000000.000
Net Income 340.000 NaN NaN NaN 1494384602.941 3940150279.328 -23528000000.000 352301250.000 707336000.000 1899000000.000 24442000000.000
Earnings Per Share 340.000 NaN NaN NaN 2.777 6.588 -61.200 1.558 2.895 4.620 50.090
Estimated Shares Outstanding 340.000 NaN NaN NaN 577028337.754 845849595.418 27672156.860 158848216.100 309675137.800 573117457.325 6159292035.000
P/E Ratio 340.000 NaN NaN NaN 32.613 44.349 2.935 15.045 20.820 31.765 528.039
P/B Ratio 340.000 NaN NaN NaN -1.718 13.967 -76.119 -4.352 -1.067 3.917 129.065

K-means Clustering¶

Checking Elbow Plot¶

In [69]:
k_means_df = subset_scaled_df.copy()
In [70]:
%%time 

clusters = range(1, 15)
meanDistortions = []

for k in clusters:
    model = KMeans(n_clusters=k, random_state=1)
    model.fit(subset_scaled_df)
    prediction = model.predict(k_means_df)
    distortion = (
        sum(np.min(cdist(k_means_df, model.cluster_centers_, "euclidean"), axis=1))
        / k_means_df.shape[0]
    )

    meanDistortions.append(distortion)

    print("Number of Clusters:", k, "\tAverage Distortion:", distortion)

plt.plot(clusters, meanDistortions, "bx-")
plt.xlabel("k")
plt.ylabel("Average Distortion")
plt.title("Selecting k with the Elbow Method", fontsize=20)
plt.show()
Number of Clusters: 1 	Average Distortion: 2.5425069919221697
Number of Clusters: 2 	Average Distortion: 2.382318498894466
Number of Clusters: 3 	Average Distortion: 2.2692367155390745
Number of Clusters: 4 	Average Distortion: 2.1745559827866363
Number of Clusters: 5 	Average Distortion: 2.128799332840716
Number of Clusters: 6 	Average Distortion: 2.080400099226289
Number of Clusters: 7 	Average Distortion: 2.0289794220177395
Number of Clusters: 8 	Average Distortion: 1.964144163389972
Number of Clusters: 9 	Average Distortion: 1.9221492045198068
Number of Clusters: 10 	Average Distortion: 1.8513913649973124
Number of Clusters: 11 	Average Distortion: 1.8024134734578485
Number of Clusters: 12 	Average Distortion: 1.7900931879652673
Number of Clusters: 13 	Average Distortion: 1.7417609203336912
Number of Clusters: 14 	Average Distortion: 1.673559857259703
CPU times: user 1.01 s, sys: 672 ms, total: 1.68 s
Wall time: 463 ms

Observations¶

  • As per the above elbow curve, it seems that 4 or 5 would be an appropriate value of k

Let's check the silhouette scores¶

In [71]:
# checking silhoutte score
sil_score = []
cluster_list = range(2, 15)
for n_clusters in cluster_list:
    clusterer = KMeans(n_clusters=n_clusters, random_state=1)
    preds = clusterer.fit_predict((k_means_df))
    score = silhouette_score(k_means_df, preds)
    sil_score.append(score)
    print("For n_clusters = {}, the silhouette score is {})".format(n_clusters, score))

plt.plot(cluster_list, sil_score)
plt.show()
For n_clusters = 2, the silhouette score is 0.43969639509980457)
For n_clusters = 3, the silhouette score is 0.4644405674779404)
For n_clusters = 4, the silhouette score is 0.4577225970476733)
For n_clusters = 5, the silhouette score is 0.43228336443659804)
For n_clusters = 6, the silhouette score is 0.4005422737213617)
For n_clusters = 7, the silhouette score is 0.3976335364987305)
For n_clusters = 8, the silhouette score is 0.40278401969450467)
For n_clusters = 9, the silhouette score is 0.3778585981433699)
For n_clusters = 10, the silhouette score is 0.13458938329968687)
For n_clusters = 11, the silhouette score is 0.1421832155528444)
For n_clusters = 12, the silhouette score is 0.2044669621527429)
For n_clusters = 13, the silhouette score is 0.23424874810104204)
For n_clusters = 14, the silhouette score is 0.12102526472829901)

Observations¶

  • From the silhouette scores as well, it seems that 4 or 5 would be a good number of clusters
In [72]:
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(4, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
Out[72]:
<AxesSubplot:title={'center':'Silhouette Plot of KMeans Clustering for 340 Samples in 4 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
In [73]:
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(5, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
Out[73]:
<AxesSubplot:title={'center':'Silhouette Plot of KMeans Clustering for 340 Samples in 5 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>

Let's build the final model with k=5 as it has a high Silhouette Score¶

Creating Final Model¶

In [74]:
%%time 

# final K-means model with k=5
kmeans = KMeans(n_clusters=5, random_state=1)
kmeans.fit(k_means_df)
CPU times: user 116 ms, sys: 499 ms, total: 615 ms
Wall time: 71.9 ms
Out[74]:
KMeans(n_clusters=5, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=5, random_state=1)
In [75]:
# creating a copy of the original data
df1 = data.copy()

# adding kmeans cluster labels to the original and scaled dataframes
k_means_df["K_means_segments"] = kmeans.labels_
df1["K_means_segments"] = kmeans.labels_

Cluster Profiling¶

In [76]:
km_cluster_profile = df1.groupby("K_means_segments").mean()
In [77]:
km_cluster_profile["count_in_each_segment"] = (
    df1.groupby("K_means_segments")["Security"].count().values)
In [78]:
km_cluster_profile.style.highlight_max(color="yellow", axis=0)
Out[78]:
  Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio count_in_each_segment
K_means_segments                        
0 65.174668 -11.542247 2.690220 37.300000 65.366667 195008366.666667 -1677736033.333333 -4.401667 544473664.718000 113.488924 1.424161 30
1 72.738269 5.179897 1.380738 34.825455 53.138182 -10147287.272727 1488641570.909091 3.636164 437961614.918582 23.680917 -3.395254 275
2 233.251108 13.682869 1.719008 29.333333 296.523810 1398716380.952381 1835686380.952381 7.126190 508721791.962857 37.805996 16.758218 21
3 50.517273 5.747586 1.130399 31.090909 75.909091 -1072272727.272727 14833090909.090910 4.154545 4298826628.727273 14.803577 -4.552119 11
4 26.990000 -14.060688 3.296307 603.000000 57.333333 -585000000.000000 -17555666666.666668 -39.726667 481910081.666667 71.528835 1.638633 3
In [79]:
# Printing the companies in each cluster
for cl in df1["K_means_segments"].unique():
    print("In cluster {}, the following companies are present:".format(cl))
    print(df1[df1["K_means_segments"] == cl]["Security"].unique())
    print()
In cluster 1, the following companies are present:
['American Airlines Group' 'AbbVie' 'Abbott Laboratories'
 'Adobe Systems Inc' 'Archer-Daniels-Midland Co' 'Ameren Corp'
 'American Electric Power' 'AFLAC Inc'
 'American International Group, Inc.' 'Apartment Investment & Mgmt'
 'Assurant Inc' 'Arthur J. Gallagher & Co.' 'Akamai Technologies Inc'
 'Albemarle Corp' 'Alaska Air Group Inc' 'Allstate Corp' 'Allegion'
 'Applied Materials Inc' 'AMETEK Inc' 'Affiliated Managers Group Inc'
 'Ameriprise Financial' 'American Tower Corp A' 'AutoNation Inc'
 'Anthem Inc.' 'Aon plc' 'Amphenol Corp' 'Arconic Inc'
 'Activision Blizzard' 'AvalonBay Communities, Inc.' 'Broadcom'
 'American Water Works Company Inc' 'American Express Co' 'Boeing Company'
 'Baxter International Inc.' 'BB&T Corporation' 'Bard (C.R.) Inc.'
 'The Bank of New York Mellon Corp.' 'Ball Corp' 'Bristol-Myers Squibb'
 'Boston Scientific' 'BorgWarner' 'Boston Properties' 'Caterpillar Inc.'
 'Chubb Limited' 'CBRE Group' 'Crown Castle International Corp.'
 'Carnival Corp.' 'CF Industries Holdings Inc' 'Citizens Financial Group'
 'Church & Dwight' 'C. H. Robinson Worldwide' 'Charter Communications'
 'CIGNA Corp.' 'Cincinnati Financial' 'Colgate-Palmolive' 'Comerica Inc.'
 'CME Group Inc.' 'Cummins Inc.' 'CMS Energy' 'Centene Corporation'
 'CenterPoint Energy' 'Capital One Financial' 'The Cooper Companies'
 'CSX Corp.' 'CenturyLink Inc' 'Cognizant Technology Solutions'
 'Citrix Systems' 'CVS Health' 'Chevron Corp.' 'Dominion Resources'
 'Delta Air Lines' 'Du Pont (E.I.)' 'Deere & Co.'
 'Discover Financial Services' 'Quest Diagnostics' 'Danaher Corp.'
 'The Walt Disney Company' 'Discovery Communications-A'
 'Discovery Communications-C' 'Delphi Automotive' 'Digital Realty Trust'
 'Dun & Bradstreet' 'Dover Corp.' 'Dr Pepper Snapple Group' 'Duke Energy'
 'DaVita Inc.' 'eBay Inc.' 'Ecolab Inc.' 'Consolidated Edison'
 'Equifax Inc.' "Edison Int'l" 'Eastman Chemical' 'Equity Residential'
 'Eversource Energy' 'Essex Property Trust, Inc.' 'E*Trade'
 'Eaton Corporation' 'Entergy Corp.' 'Exelon Corp.' "Expeditors Int'l"
 'Expedia Inc.' 'Extra Space Storage' 'Fastenal Co'
 'Fortune Brands Home & Security' 'FirstEnergy Corp'
 'Fidelity National Information Services' 'Fiserv Inc' 'FLIR Systems'
 'Fluor Corp.' 'Flowserve Corporation' 'FMC Corporation'
 'Federal Realty Investment Trust' 'General Dynamics'
 'General Growth Properties Inc.' 'Corning Inc.' 'General Motors'
 'Genuine Parts' 'Garmin Ltd.' 'Goodyear Tire & Rubber'
 'Grainger (W.W.) Inc.' 'Hasbro Inc.' 'Huntington Bancshares'
 'HCA Holdings' 'Welltower Inc.' 'HCP Inc.' 'Hartford Financial Svc.Gp.'
 'Harley-Davidson' "Honeywell Int'l Inc." 'HP Inc.' 'Hormel Foods Corp.'
 'Henry Schein' 'Host Hotels & Resorts' 'The Hershey Company'
 'Humana Inc.' 'International Business Machines' 'IDEXX Laboratories'
 'Intl Flavors & Fragrances' 'International Paper' 'Interpublic Group'
 'Iron Mountain Incorporated' 'Illinois Tool Works' 'Invesco Ltd.'
 'J. B. Hunt Transport Services' 'Jacobs Engineering Group'
 'Juniper Networks' 'Kimco Realty' 'Kimberly-Clark' 'Kansas City Southern'
 'Leggett & Platt' 'Lennar Corp.' 'Laboratory Corp. of America Holding'
 'LKQ Corporation' 'L-3 Communications Holdings' 'Lilly (Eli) & Co.'
 'Lockheed Martin Corp.' 'Alliant Energy Corp' 'Leucadia National Corp.'
 'Southwest Airlines' 'Level 3 Communications' 'LyondellBasell'
 'Mastercard Inc.' 'Mid-America Apartments' 'Macerich' "Marriott Int'l."
 'Masco Corp.' 'Mattel Inc.' "Moody's Corp" 'Mondelez International'
 'MetLife Inc.' 'Mohawk Industries' 'Mead Johnson' 'McCormick & Co.'
 'Martin Marietta Materials' 'Marsh & McLennan' '3M Company'
 'Altria Group Inc' 'Marathon Petroleum' 'Merck & Co.' 'M&T Bank Corp.'
 'Mettler Toledo' 'Mylan N.V.' 'Navient' 'NASDAQ OMX Group'
 'NextEra Energy' 'Newmont Mining Corp. (Hldg. Co.)' 'Nielsen Holdings'
 'Norfolk Southern Corp.' 'Northern Trust Corp.' 'Nucor Corp.'
 'Newell Brands' 'Realty Income Corporation' 'Omnicom Group'
 "O'Reilly Automotive" "People's United Financial" 'Pitney-Bowes'
 'PACCAR Inc.' 'PG&E Corp.' 'Public Serv. Enterprise Inc.' 'PepsiCo Inc.'
 'Principal Financial Group' 'Procter & Gamble' 'Progressive Corp.'
 'Pulte Homes Inc.' 'Philip Morris International' 'PNC Financial Services'
 'Pentair Ltd.' 'Pinnacle West Capital' 'PPG Industries' 'PPL Corp.'
 'Prudential Financial' 'Phillips 66' 'Praxair Inc.' 'PayPal'
 'Ryder System' 'Royal Caribbean Cruises Ltd' 'Robert Half International'
 'Roper Industries' 'Republic Services Inc' 'SCANA Corp'
 'Charles Schwab Corporation' 'Sealed Air' 'Sherwin-Williams'
 'SL Green Realty' 'Scripps Networks Interactive Inc.' 'Southern Co.'
 'Simon Property Group Inc' 'S&P Global, Inc.' 'Stericycle Inc'
 'Sempra Energy' 'SunTrust Banks' 'State Street Corp.'
 'Skyworks Solutions' 'Synchrony Financial' 'Stryker Corp.'
 'Molson Coors Brewing Company' 'Tegna, Inc.' 'Torchmark Corp.'
 'Thermo Fisher Scientific' 'The Travelers Companies Inc.'
 'Tractor Supply Company' 'Tyson Foods' 'Tesoro Petroleum Co.'
 'Total System Services' 'Texas Instruments' 'Under Armour'
 'United Continental Holdings' 'UDR Inc' 'Universal Health Services, Inc.'
 'United Health Group Inc.' 'Unum Group' 'Union Pacific'
 'United Parcel Service' 'United Technologies' 'Varian Medical Systems'
 'Valero Energy' 'Vulcan Materials' 'Vornado Realty Trust'
 'Verisk Analytics' 'Verisign Inc.' 'Ventas Inc' 'Wec Energy Group Inc'
 'Whirlpool Corp.' 'Waste Management Inc.' 'Western Union Co'
 'Weyerhaeuser Corp.' 'Wyndham Worldwide' 'Xcel Energy Inc' 'XL Capital'
 'Dentsply Sirona' 'Xerox Corp.' 'Xylem Inc.' 'Yum! Brands Inc'
 'Zimmer Biomet Holdings' 'Zions Bancorp' 'Zoetis']

In cluster 0, the following companies are present:
['Analog Devices, Inc.' 'Alexion Pharmaceuticals' 'Amazon.com Inc'
 'Anadarko Petroleum Corp' 'Baker Hughes Inc' 'Cabot Oil & Gas'
 'Concho Resources' 'EOG Resources' 'EQT Corporation'
 'Freeport-McMoran Cp & Gld' 'Halliburton Co.' 'Hess Corporation'
 'Hewlett Packard Enterprise' 'Kinder Morgan' 'The Mosaic Company'
 'Marathon Oil Corp.' 'Murphy Oil' 'Noble Energy Inc' 'Netflix Inc.'
 'Newfield Exploration Co' 'National Oilwell Varco Inc.' 'ONEOK'
 'Occidental Petroleum' 'Quanta Services Inc.' 'Range Resources Corp.'
 'Spectra Energy Corp.' 'Southwestern Energy' 'Teradata Corp.'
 'Williams Cos.' 'Cimarex Energy']

In cluster 2, the following companies are present:
['Alliance Data Systems' 'Amgen Inc' 'Bank of America Corp'
 'BIOGEN IDEC Inc.' 'Celgene Corp.' 'Chipotle Mexican Grill' 'Equinix'
 'Edwards Lifesciences' 'Facebook' 'First Solar Inc'
 'Frontier Communications' 'Intuitive Surgical Inc.' "McDonald's Corp."
 'Monster Beverage' 'Priceline.com Inc' 'Regeneron' 'TripAdvisor'
 'Vertex Pharmaceuticals Inc' 'Waters Corporation' 'Wynn Resorts Ltd'
 'Yahoo Inc.']

In cluster 4, the following companies are present:
['Apache Corporation' 'Chesapeake Energy' 'Devon Energy Corp.']

In cluster 3, the following companies are present:
['Citigroup Inc.' 'Ford Motor' 'Gilead Sciences' 'Intel Corp.'
 'JPMorgan Chase & Co.' 'Coca Cola Company' 'Pfizer Inc.' 'AT&T Inc'
 'Verizon Communications' 'Wells Fargo' 'Exxon Mobil Corp.']

In [80]:
df1.groupby(["K_means_segments", "GICS Sector"])['Security'].count()
Out[80]:
K_means_segments  GICS Sector                
0                 Consumer Discretionary          1
                  Energy                         21
                  Health Care                     1
                  Industrials                     1
                  Information Technology          4
                  Materials                       2
1                 Consumer Discretionary         33
                  Consumer Staples               17
                  Energy                          5
                  Financials                     45
                  Health Care                    29
                  Industrials                    52
                  Information Technology         24
                  Materials                      18
                  Real Estate                    26
                  Telecommunications Services     2
                  Utilities                      24
2                 Consumer Discretionary          5
                  Consumer Staples                1
                  Financials                      1
                  Health Care                     8
                  Information Technology          4
                  Real Estate                     1
                  Telecommunications Services     1
3                 Consumer Discretionary          1
                  Consumer Staples                1
                  Energy                          1
                  Financials                      3
                  Health Care                     2
                  Information Technology          1
                  Telecommunications Services     2
4                 Energy                          3
Name: Security, dtype: int64
In [81]:
plt.figure(figsize=(20, 20))
plt.suptitle("Boxplot of numerical variables for each cluster")

for i, variable in enumerate(num_col):
    plt.subplot(3, 4, i + 1)
    sns.boxplot(data=k_means_df, x="K_means_segments", y=variable)

plt.tight_layout(pad=2.0)
In [82]:
plt.figure(figsize=(20, 20))
plt.suptitle("Boxplot of original variables for each cluster")

for i, variable in enumerate(num_col):
    plt.subplot(3, 4, i + 1)
    sns.boxplot(data=df1, x="K_means_segments", y=variable)

plt.tight_layout(pad=2.0)

Insights¶

  • It can be seen that Cluster 0 has 30 companies/securities, Cluster 1 has 275 companies, Cluster 2 has 21 companies, Cluster 3 has 11 companies, while Cluster 4 has 3 companies

Cluster 0¶

  • Out of the 30 companies, 21 belonged to the Energy sector
  • With respect to Current Price, the stocks in this cluster were priced under 200 dollars, with a mean price of ~ 65 dollars
  • Majority of stocks exhibited negative price changes, with moderate to high volatility (average volatility is ~ 2.7)
  • Stocks had a low ROE (average ~ 37) and low cash ratio (average ~ 65)
  • Stocks in this cluster have a positive average net cash flow, suggesting cash inflows > cash outflows. However, the average net income as well as the EPS are negative
  • P/E ratio is high (average ~ 113), which indicates the willingness of investors to invest in stocks in this cluster
  • However, the average P/B ratio is low, suggesting possible undervalued stocks

Cluster 1¶

  • This cluster has stocks drawn from all 11 sectors present in the dataset
  • Out of 275 companies, the majority of stocks belonged to the Industrials, Financials and Consumer Discretionary sectors
  • With respect to Current Price, the majority of stocks in this cluster were priced under 200 dollars, with a mean price of ~ 73 dollars
  • Majority of stocks exhibited positive price changes, with low volatility (average volatility is ~ 1.3)
  • Stocks had a low ROE (average ~ 35) and low cash ratio (average ~ 53)
  • Stocks in this cluster have a negative average net cash flow, suggesting cash outflows > cash inflows. However, the average net income as well as the EPS are positive
  • Average P/E ratio is low but positive (~ 24), while the average P/B ratio is negative, suggesting undervalued stocks

Cluster 2¶

  • Out of the 21 securities, the majority came from Health Care, Consumer Discretionary and Information Technology sectors
  • This cluster has the highest current stock prices, with an average current price of ~ 233 dollars
  • This cluster also has the maximum positive price changes as compared to other clusters (average ~ 14%), with low volatility (average ~1.7)
  • Stocks had a low ROE (average ~ 29) but the highest cash ratio (average ~ 296) among all clusters, indicating the ability of companies to cover their short-term obligations using only cash and cash equivalents
  • Stocks in this cluster have a high and positive average net cash flow, suggesting cash inflows > cash outflows. Also, the average net income is positive
  • Stocks in this cluster have the highest average EPS (~ 7)
  • Average P/E ratio is low but positive and greater than cluster 1 (~ 38)
  • The average P/B ratio is the highest of all clusters (~ 17), suggesting that the stock price for stocks in this cluster is trading at a premium to the company's book value

Cluster 3¶

  • Out of the 11 securities, the majority belong to the Financials, Health Care and Telecommunications Services sectors
  • With respect to Current Price, the majority of stocks in this cluster were priced under 100 dollars, with a mean price of ~ 50 dollars
  • The stocks exhibited positive price changes, with low volatility (average volatility is ~ 1.1)
  • Stocks had a low ROE (average ~ 31) and an average cash ratio of ~ 75, which is second best after cluster 2
  • Stocks in this cluster have a negative average net cash flow, suggesting cash outflows > cash inflows
  • However, the average net income is the highest of all clusters as well as the EPS are positive (~ 4)
  • Average P/E ratio is low but positive (~ 15), while the average P/B ratio is negative (~ -4.5), suggesting undervalued stocks

Cluster 4¶

  • Out of the 3 securities, all the stocks belong to the Energy sector
  • These stocks have the lowest average current price (~ 27 dollars) and negative average price changes (~ 14) and high average volatility (~ 3.3), as was also noted in EDA
  • However, these stocks have the highest average ROE (~ 603)
  • The average net cash flow, net income and earnings per share are negative
  • As noted in EDA too, the average P/E ratio is high for this cluster (energy sector) (~ 72), and the average P/B is positive (~ 1.6), indicative of no undervaluation in stock price, hence being attractive investment opportunities for investors

Hierarchical Clustering¶

In [83]:
hc_df = subset_scaled_df.copy()

Checking Cophenetic Correlation¶

In [84]:
%%time 

# list of distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]

# list of linkage methods
linkage_methods = ["single", "complete", "average", "weighted"]

high_cophenet_corr = 0
high_dm_lm = [0, 0]

for dm in distance_metrics:
    for lm in linkage_methods:
        Z = linkage(hc_df, metric=dm, method=lm)
        c, coph_dists = cophenet(Z, pdist(hc_df))
        print(
            "Cophenetic correlation for {} distance and {} linkage is {}.".format(
                dm.capitalize(), lm, c
            )
        )
        if high_cophenet_corr < c:
            high_cophenet_corr = c
            high_dm_lm[0] = dm
            high_dm_lm[1] = lm
Cophenetic correlation for Euclidean distance and single linkage is 0.9232271494002922.
Cophenetic correlation for Euclidean distance and complete linkage is 0.7873280186580672.
Cophenetic correlation for Euclidean distance and average linkage is 0.9422540609560814.
Cophenetic correlation for Euclidean distance and weighted linkage is 0.8693784298129404.
Cophenetic correlation for Chebyshev distance and single linkage is 0.9062538164750717.
Cophenetic correlation for Chebyshev distance and complete linkage is 0.598891419111242.
Cophenetic correlation for Chebyshev distance and average linkage is 0.9338265528030499.
Cophenetic correlation for Chebyshev distance and weighted linkage is 0.9127355892367.
Cophenetic correlation for Mahalanobis distance and single linkage is 0.9259195530524591.
Cophenetic correlation for Mahalanobis distance and complete linkage is 0.7925307202850004.
Cophenetic correlation for Mahalanobis distance and average linkage is 0.9247324030159736.
Cophenetic correlation for Mahalanobis distance and weighted linkage is 0.8708317490180428.
Cophenetic correlation for Cityblock distance and single linkage is 0.9334186366528574.
Cophenetic correlation for Cityblock distance and complete linkage is 0.7375328863205818.
Cophenetic correlation for Cityblock distance and average linkage is 0.9302145048594667.
Cophenetic correlation for Cityblock distance and weighted linkage is 0.731045513520281.
CPU times: user 208 ms, sys: 80.8 ms, total: 289 ms
Wall time: 45.3 ms
In [85]:
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print("Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage.".format(high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]))
Highest cophenetic correlation is 0.9422540609560814, which is obtained with Euclidean distance and average linkage.

Let's explore different linkage methods with Euclidean distance only.¶

In [86]:
%%time 

# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]

high_cophenet_corr = 0
high_dm_lm = [0, 0]

for lm in linkage_methods:
    Z = linkage(hc_df, metric="euclidean", method=lm)
    c, coph_dists = cophenet(Z, pdist(hc_df))
    print("Cophenetic correlation for {} linkage is {}.".format(lm, c))
    if high_cophenet_corr < c:
        high_cophenet_corr = c
        high_dm_lm[0] = "euclidean"
        high_dm_lm[1] = lm
Cophenetic correlation for single linkage is 0.9232271494002922.
Cophenetic correlation for complete linkage is 0.7873280186580672.
Cophenetic correlation for average linkage is 0.9422540609560814.
Cophenetic correlation for centroid linkage is 0.9314012446828154.
Cophenetic correlation for ward linkage is 0.7101180299865353.
Cophenetic correlation for weighted linkage is 0.8693784298129404.
CPU times: user 109 ms, sys: 84.2 ms, total: 194 ms
Wall time: 22.4 ms
In [87]:
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print("Highest cophenetic correlation is {}, which is obtained with {} linkage.".format(high_cophenet_corr, high_dm_lm[1]))
Highest cophenetic correlation is 0.9422540609560814, which is obtained with average linkage.

Therefore, we see that the cophenetic correlation is maximum with Euclidean distance and average linkage.¶

Checking Dendrograms¶

In [88]:
%%time 

# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]

# lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
compare = []

# to create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))

# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
    Z = linkage(hc_df, metric="euclidean", method=method)

    dendrogram(Z, ax=axs[i])
    axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")

    coph_corr, coph_dist = cophenet(Z, pdist(hc_df))
    axs[i].annotate(
        f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
        (0.80, 0.80),
        xycoords="axes fraction")

    compare.append([method, coph_corr])    
CPU times: user 1.01 s, sys: 1.24 s, total: 2.25 s
Wall time: 872 ms
In [89]:
# let's create a dataframe to compare cophenetic correlations for each linkage method
df_cc = pd.DataFrame(compare, columns=compare_cols)
df_cc
Out[89]:
Linkage Cophenetic Coefficient
0 single 0.923
1 complete 0.787
2 average 0.942
3 centroid 0.931
4 ward 0.710
5 weighted 0.869

Observations¶

  • The cophenetic correlation is highest for average linkage method at 0.94.
  • So, we will move ahead with Average linkage first and Ward linkage next, as ward linkage has more distributed clusters.
  • 5 appears to be the appropriate number of clusters from the dendrogram for average linkage.

Creating model using sklearn using Eucledian distance and Average linkage¶

In [90]:
%%time 

HCmodel = AgglomerativeClustering(n_clusters=5, affinity="euclidean", linkage="average")
HCmodel.fit(hc_df)
CPU times: user 2.22 ms, sys: 864 µs, total: 3.09 ms
Wall time: 2.26 ms
Out[90]:
AgglomerativeClustering(linkage='average', n_clusters=5)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AgglomerativeClustering(linkage='average', n_clusters=5)
In [91]:
# creating a copy of the original data
df2 = data.copy()

# adding hierarchical cluster labels to the original and scaled dataframes
hc_df["HC_segments"] = HCmodel.labels_
df2["HC_segments"] = HCmodel.labels_

Cluster Profiling¶

In [92]:
hc_cluster_profile = df2.groupby("HC_segments").mean() 
In [93]:
hc_cluster_profile["count_in_each_segment"] = (
    df2.groupby("HC_segments")["Security"].count().values)
In [94]:
hc_cluster_profile.style.highlight_max(color="yellow", axis=0)
Out[94]:
  Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio count_in_each_segment
HC_segments                        
0 77.884243 4.105986 1.516865 35.320359 66.775449 -32825817.365269 1535255703.592814 2.903308 559027333.145509 32.437511 -1.781988 334
1 25.640000 11.237908 1.322355 12.500000 130.500000 16755500000.000000 13654000000.000000 3.295000 2791829362.100000 13.649696 1.508484 2
2 24.485001 -13.351992 3.482611 802.000000 51.000000 -1292500000.000000 -19106500000.000000 -41.815000 519573983.250000 60.748608 1.565141 2
3 104.660004 16.224320 1.320606 8.000000 958.000000 592000000.000000 3669000000.000000 1.310000 2800763359.000000 79.893133 5.884467 1
4 1274.949951 3.190527 1.268340 29.000000 184.000000 -1671386000.000000 2551360000.000000 50.090000 50935516.070000 25.453183 -1.052429 1
In [95]:
## Complete the code to print the companies in each cluster
for cl in df2["HC_segments"].unique():
    print("In cluster {}, the following companies are present:".format(cl))
    print(df2[df2["HC_segments"] == cl]["Security"].unique())
    print()
In cluster 0, the following companies are present:
['American Airlines Group' 'AbbVie' 'Abbott Laboratories'
 'Adobe Systems Inc' 'Analog Devices, Inc.' 'Archer-Daniels-Midland Co'
 'Alliance Data Systems' 'Ameren Corp' 'American Electric Power'
 'AFLAC Inc' 'American International Group, Inc.'
 'Apartment Investment & Mgmt' 'Assurant Inc' 'Arthur J. Gallagher & Co.'
 'Akamai Technologies Inc' 'Albemarle Corp' 'Alaska Air Group Inc'
 'Allstate Corp' 'Allegion' 'Alexion Pharmaceuticals'
 'Applied Materials Inc' 'AMETEK Inc' 'Affiliated Managers Group Inc'
 'Amgen Inc' 'Ameriprise Financial' 'American Tower Corp A'
 'Amazon.com Inc' 'AutoNation Inc' 'Anthem Inc.' 'Aon plc'
 'Anadarko Petroleum Corp' 'Amphenol Corp' 'Arconic Inc'
 'Activision Blizzard' 'AvalonBay Communities, Inc.' 'Broadcom'
 'American Water Works Company Inc' 'American Express Co' 'Boeing Company'
 'Baxter International Inc.' 'BB&T Corporation' 'Bard (C.R.) Inc.'
 'Baker Hughes Inc' 'BIOGEN IDEC Inc.' 'The Bank of New York Mellon Corp.'
 'Ball Corp' 'Bristol-Myers Squibb' 'Boston Scientific' 'BorgWarner'
 'Boston Properties' 'Citigroup Inc.' 'Caterpillar Inc.' 'Chubb Limited'
 'CBRE Group' 'Crown Castle International Corp.' 'Carnival Corp.'
 'Celgene Corp.' 'CF Industries Holdings Inc' 'Citizens Financial Group'
 'Church & Dwight' 'C. H. Robinson Worldwide' 'Charter Communications'
 'CIGNA Corp.' 'Cincinnati Financial' 'Colgate-Palmolive' 'Comerica Inc.'
 'CME Group Inc.' 'Chipotle Mexican Grill' 'Cummins Inc.' 'CMS Energy'
 'Centene Corporation' 'CenterPoint Energy' 'Capital One Financial'
 'Cabot Oil & Gas' 'The Cooper Companies' 'CSX Corp.' 'CenturyLink Inc'
 'Cognizant Technology Solutions' 'Citrix Systems' 'CVS Health'
 'Chevron Corp.' 'Concho Resources' 'Dominion Resources' 'Delta Air Lines'
 'Du Pont (E.I.)' 'Deere & Co.' 'Discover Financial Services'
 'Quest Diagnostics' 'Danaher Corp.' 'The Walt Disney Company'
 'Discovery Communications-A' 'Discovery Communications-C'
 'Delphi Automotive' 'Digital Realty Trust' 'Dun & Bradstreet'
 'Dover Corp.' 'Dr Pepper Snapple Group' 'Duke Energy' 'DaVita Inc.'
 'Devon Energy Corp.' 'eBay Inc.' 'Ecolab Inc.' 'Consolidated Edison'
 'Equifax Inc.' "Edison Int'l" 'Eastman Chemical' 'EOG Resources'
 'Equinix' 'Equity Residential' 'EQT Corporation' 'Eversource Energy'
 'Essex Property Trust, Inc.' 'E*Trade' 'Eaton Corporation'
 'Entergy Corp.' 'Edwards Lifesciences' 'Exelon Corp.' "Expeditors Int'l"
 'Expedia Inc.' 'Extra Space Storage' 'Ford Motor' 'Fastenal Co'
 'Fortune Brands Home & Security' 'Freeport-McMoran Cp & Gld'
 'FirstEnergy Corp' 'Fidelity National Information Services' 'Fiserv Inc'
 'FLIR Systems' 'Fluor Corp.' 'Flowserve Corporation' 'FMC Corporation'
 'Federal Realty Investment Trust' 'First Solar Inc'
 'Frontier Communications' 'General Dynamics'
 'General Growth Properties Inc.' 'Gilead Sciences' 'Corning Inc.'
 'General Motors' 'Genuine Parts' 'Garmin Ltd.' 'Goodyear Tire & Rubber'
 'Grainger (W.W.) Inc.' 'Halliburton Co.' 'Hasbro Inc.'
 'Huntington Bancshares' 'HCA Holdings' 'Welltower Inc.' 'HCP Inc.'
 'Hess Corporation' 'Hartford Financial Svc.Gp.' 'Harley-Davidson'
 "Honeywell Int'l Inc." 'Hewlett Packard Enterprise' 'HP Inc.'
 'Hormel Foods Corp.' 'Henry Schein' 'Host Hotels & Resorts'
 'The Hershey Company' 'Humana Inc.' 'International Business Machines'
 'IDEXX Laboratories' 'Intl Flavors & Fragrances' 'International Paper'
 'Interpublic Group' 'Iron Mountain Incorporated'
 'Intuitive Surgical Inc.' 'Illinois Tool Works' 'Invesco Ltd.'
 'J. B. Hunt Transport Services' 'Jacobs Engineering Group'
 'Juniper Networks' 'JPMorgan Chase & Co.' 'Kimco Realty' 'Kimberly-Clark'
 'Kinder Morgan' 'Coca Cola Company' 'Kansas City Southern'
 'Leggett & Platt' 'Lennar Corp.' 'Laboratory Corp. of America Holding'
 'LKQ Corporation' 'L-3 Communications Holdings' 'Lilly (Eli) & Co.'
 'Lockheed Martin Corp.' 'Alliant Energy Corp' 'Leucadia National Corp.'
 'Southwest Airlines' 'Level 3 Communications' 'LyondellBasell'
 'Mastercard Inc.' 'Mid-America Apartments' 'Macerich' "Marriott Int'l."
 'Masco Corp.' 'Mattel Inc.' "McDonald's Corp." "Moody's Corp"
 'Mondelez International' 'MetLife Inc.' 'Mohawk Industries'
 'Mead Johnson' 'McCormick & Co.' 'Martin Marietta Materials'
 'Marsh & McLennan' '3M Company' 'Monster Beverage' 'Altria Group Inc'
 'The Mosaic Company' 'Marathon Petroleum' 'Merck & Co.'
 'Marathon Oil Corp.' 'M&T Bank Corp.' 'Mettler Toledo' 'Murphy Oil'
 'Mylan N.V.' 'Navient' 'Noble Energy Inc' 'NASDAQ OMX Group'
 'NextEra Energy' 'Newmont Mining Corp. (Hldg. Co.)' 'Netflix Inc.'
 'Newfield Exploration Co' 'Nielsen Holdings'
 'National Oilwell Varco Inc.' 'Norfolk Southern Corp.'
 'Northern Trust Corp.' 'Nucor Corp.' 'Newell Brands'
 'Realty Income Corporation' 'ONEOK' 'Omnicom Group' "O'Reilly Automotive"
 'Occidental Petroleum' "People's United Financial" 'Pitney-Bowes'
 'PACCAR Inc.' 'PG&E Corp.' 'Public Serv. Enterprise Inc.' 'PepsiCo Inc.'
 'Pfizer Inc.' 'Principal Financial Group' 'Procter & Gamble'
 'Progressive Corp.' 'Pulte Homes Inc.' 'Philip Morris International'
 'PNC Financial Services' 'Pentair Ltd.' 'Pinnacle West Capital'
 'PPG Industries' 'PPL Corp.' 'Prudential Financial' 'Phillips 66'
 'Quanta Services Inc.' 'Praxair Inc.' 'PayPal' 'Ryder System'
 'Royal Caribbean Cruises Ltd' 'Regeneron' 'Robert Half International'
 'Roper Industries' 'Range Resources Corp.' 'Republic Services Inc'
 'SCANA Corp' 'Charles Schwab Corporation' 'Spectra Energy Corp.'
 'Sealed Air' 'Sherwin-Williams' 'SL Green Realty'
 'Scripps Networks Interactive Inc.' 'Southern Co.'
 'Simon Property Group Inc' 'S&P Global, Inc.' 'Stericycle Inc'
 'Sempra Energy' 'SunTrust Banks' 'State Street Corp.'
 'Skyworks Solutions' 'Southwestern Energy' 'Synchrony Financial'
 'Stryker Corp.' 'AT&T Inc' 'Molson Coors Brewing Company'
 'Teradata Corp.' 'Tegna, Inc.' 'Torchmark Corp.'
 'Thermo Fisher Scientific' 'TripAdvisor' 'The Travelers Companies Inc.'
 'Tractor Supply Company' 'Tyson Foods' 'Tesoro Petroleum Co.'
 'Total System Services' 'Texas Instruments' 'Under Armour'
 'United Continental Holdings' 'UDR Inc' 'Universal Health Services, Inc.'
 'United Health Group Inc.' 'Unum Group' 'Union Pacific'
 'United Parcel Service' 'United Technologies' 'Varian Medical Systems'
 'Valero Energy' 'Vulcan Materials' 'Vornado Realty Trust'
 'Verisk Analytics' 'Verisign Inc.' 'Vertex Pharmaceuticals Inc'
 'Ventas Inc' 'Verizon Communications' 'Waters Corporation'
 'Wec Energy Group Inc' 'Wells Fargo' 'Whirlpool Corp.'
 'Waste Management Inc.' 'Williams Cos.' 'Western Union Co'
 'Weyerhaeuser Corp.' 'Wyndham Worldwide' 'Wynn Resorts Ltd'
 'Cimarex Energy' 'Xcel Energy Inc' 'XL Capital' 'Exxon Mobil Corp.'
 'Dentsply Sirona' 'Xerox Corp.' 'Xylem Inc.' 'Yahoo Inc.'
 'Yum! Brands Inc' 'Zimmer Biomet Holdings' 'Zions Bancorp' 'Zoetis']

In cluster 2, the following companies are present:
['Apache Corporation' 'Chesapeake Energy']

In cluster 1, the following companies are present:
['Bank of America Corp' 'Intel Corp.']

In cluster 3, the following companies are present:
['Facebook']

In cluster 4, the following companies are present:
['Priceline.com Inc']

In [96]:
df2.groupby(["HC_segments", "GICS Sector"])['Security'].count()
Out[96]:
HC_segments  GICS Sector                
0            Consumer Discretionary         39
             Consumer Staples               19
             Energy                         28
             Financials                     48
             Health Care                    40
             Industrials                    53
             Information Technology         31
             Materials                      20
             Real Estate                    27
             Telecommunications Services     5
             Utilities                      24
1            Financials                      1
             Information Technology          1
2            Energy                          2
3            Information Technology          1
4            Consumer Discretionary          1
Name: Security, dtype: int64

Observations¶

  • Average Linkage method does not result in representative and distributed clusters, as Cluster 0 comprises 334 securities, Cluster 1 & 2 consist of 2 companies each, while Clusters 3 & 4 have only 1 security each
  • Ward Linkage method has more distributed clusters and hence, let's move forward with this method

Creating model using sklearn using Eucledian distance and Ward linkage¶

In [97]:
%%time 

HCmodel = AgglomerativeClustering(n_clusters=5, affinity="euclidean", linkage="ward")
HCmodel.fit(hc_df)
CPU times: user 2.6 ms, sys: 744 µs, total: 3.34 ms
Wall time: 2.61 ms
Out[97]:
AgglomerativeClustering(n_clusters=5)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AgglomerativeClustering(n_clusters=5)
In [98]:
# creating a copy of the original data
df3 = data.copy()

# adding hierarchical cluster labels to the original and scaled dataframes
hc_df["HC_segments"] = HCmodel.labels_
df3["HC_segments"] = HCmodel.labels_

Cluster Profiling¶

In [99]:
hc_cluster_profile = df3.groupby("HC_segments").mean() 
In [100]:
hc_cluster_profile["count_in_each_segment"] = (
    df3.groupby("HC_segments")["Security"].count().values)
In [101]:
hc_cluster_profile.style.highlight_max(color="yellow", axis=0)
Out[101]:
  Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio count_in_each_segment
HC_segments                        
0 213.518640 15.252913 1.779861 22.333333 258.740741 1504052814.814815 1716529851.851852 5.177407 689838338.441482 78.441603 13.022590 27
1 84.355716 3.854981 1.827670 633.571429 33.571429 -568400000.000000 -4968157142.857142 -10.841429 398169036.442857 42.284541 -11.589502 7
2 46.672222 5.166566 1.079367 25.000000 58.333333 -3040666666.666667 14848444444.444445 3.435556 4564959946.222222 15.596051 -6.354193 9
3 72.421687 4.563230 1.403434 25.218182 55.014545 72801872.727273 1572467469.090909 3.728564 445003946.148764 24.188244 -2.966949 275
4 36.440455 -16.073408 2.832884 57.500000 42.409091 -472834090.909091 -3161045227.272727 -8.005000 514367806.201818 85.555682 0.836839 22
In [102]:
## Complete the code to print the companies in each cluster
for cl in df3["HC_segments"].unique():
    print("In cluster {}, the following companies are present:".format(cl))
    print(df3[df3["HC_segments"] == cl]["Security"].unique())
    print()
In cluster 3, the following companies are present:
['American Airlines Group' 'AbbVie' 'Abbott Laboratories'
 'Analog Devices, Inc.' 'Archer-Daniels-Midland Co' 'Ameren Corp'
 'American Electric Power' 'AFLAC Inc'
 'American International Group, Inc.' 'Apartment Investment & Mgmt'
 'Assurant Inc' 'Arthur J. Gallagher & Co.' 'Akamai Technologies Inc'
 'Albemarle Corp' 'Alaska Air Group Inc' 'Allstate Corp' 'AMETEK Inc'
 'Affiliated Managers Group Inc' 'Ameriprise Financial'
 'American Tower Corp A' 'AutoNation Inc' 'Anthem Inc.' 'Aon plc'
 'Amphenol Corp' 'Arconic Inc' 'Activision Blizzard'
 'AvalonBay Communities, Inc.' 'American Water Works Company Inc'
 'American Express Co' 'Boeing Company' 'Baxter International Inc.'
 'BB&T Corporation' 'Bard (C.R.) Inc.' 'BIOGEN IDEC Inc.'
 'The Bank of New York Mellon Corp.' 'Ball Corp' 'Bristol-Myers Squibb'
 'Boston Scientific' 'BorgWarner' 'Boston Properties' 'Caterpillar Inc.'
 'Chubb Limited' 'CBRE Group' 'Crown Castle International Corp.'
 'Carnival Corp.' 'CF Industries Holdings Inc' 'Citizens Financial Group'
 'Church & Dwight' 'C. H. Robinson Worldwide' 'CIGNA Corp.'
 'Cincinnati Financial' 'Comerica Inc.' 'CME Group Inc.' 'Cummins Inc.'
 'CMS Energy' 'Centene Corporation' 'CenterPoint Energy'
 'Capital One Financial' 'The Cooper Companies' 'CSX Corp.'
 'CenturyLink Inc' 'Cognizant Technology Solutions' 'Citrix Systems'
 'CVS Health' 'Chevron Corp.' 'Dominion Resources' 'Delta Air Lines'
 'Du Pont (E.I.)' 'Deere & Co.' 'Discover Financial Services'
 'Quest Diagnostics' 'Danaher Corp.' 'The Walt Disney Company'
 'Discovery Communications-A' 'Discovery Communications-C'
 'Delphi Automotive' 'Digital Realty Trust' 'Dun & Bradstreet'
 'Dover Corp.' 'Dr Pepper Snapple Group' 'Duke Energy' 'DaVita Inc.'
 'eBay Inc.' 'Ecolab Inc.' 'Consolidated Edison' 'Equifax Inc.'
 "Edison Int'l" 'Eastman Chemical' 'Equity Residential' 'EQT Corporation'
 'Eversource Energy' 'Essex Property Trust, Inc.' 'E*Trade'
 'Eaton Corporation' 'Entergy Corp.' 'Exelon Corp.' "Expeditors Int'l"
 'Expedia Inc.' 'Extra Space Storage' 'Fastenal Co'
 'Fortune Brands Home & Security' 'FirstEnergy Corp'
 'Fidelity National Information Services' 'Fiserv Inc' 'FLIR Systems'
 'Fluor Corp.' 'Flowserve Corporation' 'FMC Corporation'
 'Federal Realty Investment Trust' 'General Dynamics'
 'General Growth Properties Inc.' 'Gilead Sciences' 'Corning Inc.'
 'General Motors' 'Genuine Parts' 'Garmin Ltd.' 'Goodyear Tire & Rubber'
 'Grainger (W.W.) Inc.' 'Halliburton Co.' 'Hasbro Inc.'
 'Huntington Bancshares' 'HCA Holdings' 'Welltower Inc.' 'HCP Inc.'
 'Hartford Financial Svc.Gp.' 'Harley-Davidson' "Honeywell Int'l Inc."
 'Hewlett Packard Enterprise' 'HP Inc.' 'Hormel Foods Corp.'
 'Henry Schein' 'Host Hotels & Resorts' 'The Hershey Company'
 'Humana Inc.' 'International Business Machines' 'IDEXX Laboratories'
 'Intl Flavors & Fragrances' 'International Paper' 'Interpublic Group'
 'Iron Mountain Incorporated' 'Illinois Tool Works' 'Invesco Ltd.'
 'J. B. Hunt Transport Services' 'Jacobs Engineering Group'
 'Juniper Networks' 'Kimco Realty' 'Kansas City Southern'
 'Leggett & Platt' 'Lennar Corp.' 'Laboratory Corp. of America Holding'
 'LKQ Corporation' 'L-3 Communications Holdings' 'Lilly (Eli) & Co.'
 'Lockheed Martin Corp.' 'Alliant Energy Corp' 'Leucadia National Corp.'
 'Southwest Airlines' 'Level 3 Communications' 'LyondellBasell'
 'Mastercard Inc.' 'Mid-America Apartments' 'Macerich' "Marriott Int'l."
 'Masco Corp.' 'Mattel Inc.' "McDonald's Corp." "Moody's Corp"
 'Mondelez International' 'MetLife Inc.' 'Mohawk Industries'
 'Mead Johnson' 'McCormick & Co.' 'Martin Marietta Materials'
 'Marsh & McLennan' '3M Company' 'Altria Group Inc' 'The Mosaic Company'
 'Marathon Petroleum' 'Merck & Co.' 'M&T Bank Corp.' 'Mettler Toledo'
 'Mylan N.V.' 'Navient' 'NASDAQ OMX Group' 'NextEra Energy'
 'Nielsen Holdings' 'Norfolk Southern Corp.' 'Northern Trust Corp.'
 'Nucor Corp.' 'Newell Brands' 'Realty Income Corporation' 'Omnicom Group'
 "O'Reilly Automotive" "People's United Financial" 'Pitney-Bowes'
 'PACCAR Inc.' 'PG&E Corp.' 'Public Serv. Enterprise Inc.' 'PepsiCo Inc.'
 'Principal Financial Group' 'Procter & Gamble' 'Progressive Corp.'
 'Pulte Homes Inc.' 'Philip Morris International' 'PNC Financial Services'
 'Pentair Ltd.' 'Pinnacle West Capital' 'PPG Industries' 'PPL Corp.'
 'Prudential Financial' 'Phillips 66' 'Quanta Services Inc.'
 'Praxair Inc.' 'PayPal' 'Ryder System' 'Royal Caribbean Cruises Ltd'
 'Robert Half International' 'Roper Industries' 'Republic Services Inc'
 'SCANA Corp' 'Charles Schwab Corporation' 'Sealed Air' 'Sherwin-Williams'
 'SL Green Realty' 'Scripps Networks Interactive Inc.' 'Southern Co.'
 'Simon Property Group Inc' 'Stericycle Inc' 'Sempra Energy'
 'SunTrust Banks' 'State Street Corp.' 'Skyworks Solutions'
 'Synchrony Financial' 'Stryker Corp.' 'Molson Coors Brewing Company'
 'Tegna, Inc.' 'Torchmark Corp.' 'Thermo Fisher Scientific'
 'The Travelers Companies Inc.' 'Tractor Supply Company' 'Tyson Foods'
 'Tesoro Petroleum Co.' 'Total System Services' 'Texas Instruments'
 'Under Armour' 'United Continental Holdings' 'UDR Inc'
 'Universal Health Services, Inc.' 'United Health Group Inc.' 'Unum Group'
 'Union Pacific' 'United Parcel Service' 'United Technologies'
 'Varian Medical Systems' 'Valero Energy' 'Vulcan Materials'
 'Vornado Realty Trust' 'Verisk Analytics' 'Verisign Inc.' 'Ventas Inc'
 'Wec Energy Group Inc' 'Whirlpool Corp.' 'Waste Management Inc.'
 'Western Union Co' 'Weyerhaeuser Corp.' 'Wyndham Worldwide'
 'Xcel Energy Inc' 'XL Capital' 'Dentsply Sirona' 'Xerox Corp.'
 'Xylem Inc.' 'Yum! Brands Inc' 'Zimmer Biomet Holdings' 'Zions Bancorp'
 'Zoetis']

In cluster 0, the following companies are present:
['Adobe Systems Inc' 'Alliance Data Systems' 'Alexion Pharmaceuticals'
 'Applied Materials Inc' 'Amgen Inc' 'Amazon.com Inc' 'Broadcom'
 'Bank of America Corp' 'Celgene Corp.' 'Chipotle Mexican Grill' 'Equinix'
 'Edwards Lifesciences' 'Facebook' 'First Solar Inc'
 'Frontier Communications' 'Intel Corp.' 'Intuitive Surgical Inc.'
 'Monster Beverage' 'Newmont Mining Corp. (Hldg. Co.)' 'Netflix Inc.'
 'Priceline.com Inc' 'Regeneron' 'TripAdvisor'
 'Vertex Pharmaceuticals Inc' 'Waters Corporation' 'Wynn Resorts Ltd'
 'Yahoo Inc.']

In cluster 1, the following companies are present:
['Allegion' 'Apache Corporation' 'Chesapeake Energy'
 'Charter Communications' 'Colgate-Palmolive' 'Kimberly-Clark'
 'S&P Global, Inc.']

In cluster 4, the following companies are present:
['Anadarko Petroleum Corp' 'Baker Hughes Inc' 'Cabot Oil & Gas'
 'Concho Resources' 'Devon Energy Corp.' 'EOG Resources'
 'Freeport-McMoran Cp & Gld' 'Hess Corporation' 'Kinder Morgan'
 'Marathon Oil Corp.' 'Murphy Oil' 'Noble Energy Inc'
 'Newfield Exploration Co' 'National Oilwell Varco Inc.' 'ONEOK'
 'Occidental Petroleum' 'Range Resources Corp.' 'Spectra Energy Corp.'
 'Southwestern Energy' 'Teradata Corp.' 'Williams Cos.' 'Cimarex Energy']

In cluster 2, the following companies are present:
['Citigroup Inc.' 'Ford Motor' 'JPMorgan Chase & Co.' 'Coca Cola Company'
 'Pfizer Inc.' 'AT&T Inc' 'Verizon Communications' 'Wells Fargo'
 'Exxon Mobil Corp.']

In [103]:
df3.groupby(["HC_segments", "GICS Sector"])['Security'].count()
Out[103]:
HC_segments  GICS Sector                
0            Consumer Discretionary          5
             Consumer Staples                1
             Financials                      1
             Health Care                     8
             Information Technology          9
             Materials                       1
             Real Estate                     1
             Telecommunications Services     1
1            Consumer Discretionary          1
             Consumer Staples                2
             Energy                          2
             Financials                      1
             Industrials                     1
2            Consumer Discretionary          1
             Consumer Staples                1
             Energy                          1
             Financials                      3
             Health Care                     1
             Telecommunications Services     2
3            Consumer Discretionary         33
             Consumer Staples               15
             Energy                          7
             Financials                     44
             Health Care                    31
             Industrials                    52
             Information Technology         23
             Materials                      18
             Real Estate                    26
             Telecommunications Services     2
             Utilities                      24
4            Energy                         20
             Information Technology          1
             Materials                       1
Name: Security, dtype: int64
In [104]:
plt.figure(figsize=(20, 20))
plt.suptitle("Boxplot of numerical variables for each cluster")

for i, variable in enumerate(num_col):
    plt.subplot(3, 4, i + 1)
    sns.boxplot(data=hc_df, x="HC_segments", y=variable)

plt.tight_layout(pad=2.0)
In [105]:
plt.figure(figsize=(20, 20))
plt.suptitle("Boxplot of original numerical variables for each cluster")

for i, variable in enumerate(num_col):
    plt.subplot(3, 4, i + 1)
    sns.boxplot(data=df3, x="HC_segments", y=variable)

plt.tight_layout(pad=2.0)

Insights¶

  • It can be seen that Cluster 0 has 27 companies/securities, Cluster 1 has 7 companies, Cluster 2 has 9 companies, Cluster 3 has 275 companies, while Cluster 4 has 22 companies

Cluster 0¶

  • Out of the 27 companies, majority of the companies belonged to the Information Technology, Health Care, Consumer Discretionary sectors
  • The stocks in this cluster have the highest average current price (~ 214), with high positive price changes (average ~ 15%) and low average volatility (~ 1.8)
  • Stocks had a low ROE (average ~ 22), but the highest cash ratio (average ~ 259) among all clusters, indicating the ability of companies to cover their short-term obligations using only cash and cash equivalents
  • Stocks in this cluster have a positive average net cash flow, suggesting cash inflows > cash outflows. Also, the average net income as well as the EPS are positive
  • The average P/E ratio is high (average ~ 78), which indicates the willingness of investors to invest in stocks in this cluster
  • This cluster has the highest average P/B ratio, suggesting that the stock price for stocks in this cluster is trading at a premium to the company's book value

Cluster 1¶

  • Out of the 7 companies, majority of the companies belonged to the Energy and Comsumer staples sectors
  • With respect to Current Price, the majority of stocks in this cluster were priced under 200 dollars, with a mean price of ~ 84 dollars
  • Majority of stocks exhibited positive price changes, with low volatility (average volatility is ~ 1.8)
  • Stocks in this cluster had the highest average ROE (average ~ 634) and low cash ratio (average ~ 33)
  • Stocks in this cluster have a negative average net cash flow, suggesting cash outflows > cash inflows. Further, the average net income as well as the EPS are also negative
  • Average P/E ratio is moderate (~ 42) as compared to other clusters, while the average P/B ratio is negative, suggesting undervalued stocks

Cluster 2¶

  • Out of the 9 securities, the majority came from Financials and Telecommunications Services sectors
  • With respect to the Current Price, the stocks in this cluster were priced low (average ~ 47 dollars), with positive average price changes (~ 5%) and low volatility (~ 1)
  • Stocks had a low ROE (average ~ 25) and an average cash ratio of ~ 58
  • Stocks in this cluster have a negative average net cash flow, suggesting cash outflows > cash inflows
  • However, this cluster has the highest positive average net income, suggesting revenues other than cash
  • Stocks in this cluster have positive but low average EPS (~ 3)
  • Average P/E ratio is positive but the lowest of all clusters (~16), while the average P/B ratio is negative, suggesting undervalued stocks

Cluster 3¶

  • This is the biggest and most diversified cluster
  • Out of the 275 securities, the majority belong to the Industrials, Financials, Consumer Discretionary and Health Care sectors
  • With respect to Current Price, the majority of stocks in this cluster were priced under 100 dollars, with a mean price of ~ 72 dollars
  • The stocks exhibited positive price changes, with low volatility (average volatility is ~ 1.4)
  • Stocks had a low ROE (average ~ 25) and an average cash ratio of ~ 55, indicating low liquidity
  • The average net cash flow, net income and EPS are positive
  • Average P/E ratio is low but positive (~ 24), while the average P/B ratio is negative (~ -2.9), suggesting undervalued stocks

Cluster 4¶

  • Out of the 22 securities, 20 stocks belong to the Energy sector
  • The stocks in this cluster have the lowest average current price (~ 36), with negative price changes (average ~ -16%) and highest average volatility (~ 2.8) among all clusters
  • This was also seen in EDA
  • Stocks in this cluster had the second best average ROE after cluster 1 (~ 57) and low cash ratio (average ~ 42)
  • The average net cash flow, net income and earnings per share are negative
  • As noted in EDA too, the average P/E ratio is the highest for this cluster (energy sector) (~ 85), and the average P/B is low but positive (~ 1), indicative of no undervaluation in stock price, hence being attractive investment opportunities for investors

Dimensionality Reduction using PCA¶

Let's use PCA to reduce the data to two dimensions and visualize it to see how well-separated the clusters are

In [106]:
# importing library
from sklearn.decomposition import PCA

# setting the number of components to 2
pca = PCA(n_components=2)

# transforming data and storing results in a dataframe
X_reduced_pca = pca.fit_transform(subset_scaled_df)
reduced_df_pca = pd.DataFrame(
    data=X_reduced_pca, columns=["Component 1", "Component 2"])
In [107]:
# checking the amount of variance explained
pca.explained_variance_ratio_.sum()
Out[107]:
0.3712899567725857

Observation¶

  • The first two principal components explain 37.1% of the variance in the data
In [108]:
sns.scatterplot(data=reduced_df_pca, x="Component 1", y="Component 2")
Out[108]:
<AxesSubplot:xlabel='Component 1', ylabel='Component 2'>
In [109]:
sns.scatterplot(
    data=reduced_df_pca,
    x="Component 1",
    y="Component 2",
    hue=df3["HC_segments"],
    palette="rainbow",
)
plt.legend(bbox_to_anchor=(1, 1))
Out[109]:
<matplotlib.legend.Legend at 0x2a79597f0>

Observations¶

  • We can see that Cluster 3 contains majority of data points
  • Other clusters (Clusters 0, 1, 2 and 4) are also recognizable but they are capturing outliers as well

K-means vs Hierarchical Clustering¶

There are more similarities than dissimilarities between clusters obtained from K-Means clustering & Hierarchical clustering¶

Which clustering technique took less time for execution?¶

  • Both the techniques, K-Means clustering & Hierarchical clustering, were quick and were able to fit the model within ~1 second

Which clustering technique gave you more distinct clusters, or are they the same?¶

  • Both the techniques, K-Means clustering & Hierarchical clustering, gave similar clusters with no recognizable differences between them

How many observations are there in the similar clusters of both algorithms?¶

  • Both algorithms had the following clusters:
    • For k-means, Cluster 0 has 30 companies/securities, Cluster 1 has 275 companies, Cluster 2 has 21 companies, Cluster 3 has 11 companies, while Cluster 4 has 3 companies
    • For hierarchical clustering, Cluster 0 has 27 stocks, Cluster 1 has 7 stocks, Cluster 2 has 9 stocks, Cluster 3 has 275 stocks and Cluster 4 has 22 stocks
  • Hence, the similar clusters for K-Means clustering & Hierarchical clustering comprised of 275 securities/companies

How many clusters are obtained as the appropriate number of clusters from both algorithms?¶

  • For both techniques, K-Means clustering & Hierarchical clustering, 5 clusters were taken as appropriate to ensure sufficient representation of all securities

Actionable Insights and Recommendations¶

Based on the Exploratory Data Analysisconducted and the Clustering techniques used, the following insights and recommendations can be made:

  • Both clustering techniques yielded 5 clusters, where majority of securities were essentially clustered in a similar manner.

  • One cluster (with 3 securities as per k-means and 22 securities as per hierarchical clustering) majorly comprised of companies from Energy Sector.

    • This cluster was peculiar due to its riskiness with the lowest average current price, with negative price changes and highest average volatility amongst all clusters.
    • Nonetheless, it had the highest P/E Ratio, which suggests the investors were willing to invest more in a single share of a company/security in this sector per dollar of their earnings as opposed to securities in other industry sectors.
    • Therefore, risk-seeking investors can certainly look towards investing in securities from this cluster.
  • Second cluster (with 275 securities each as per k-means and hierarchical clustering) was the most diversified portfolio of stocks, which allows investors to spread risk

    • This consisted majorly of Industrials, Financials, Consumer Discretionary and Health Care sectors.
    • The average current stock price was moderate (below ~ 200 dollars) while the price change was positive with low volatility, making this an affordable and risk-neutral investment option.
    • However, investors must be mindful of undervalued stocks in their portfolio.
    • Therefore, investors seeking to hedge risk can go forward with investing in securities from this cluster.
  • The third cluster (with 21 securities as per k-means and 27 securities as per hierarchical clustering) majorly comprised of the Health Care, Consumer Discretionary and Information Technology sectors.

    • Although the securities of this cluster have the highest current price, but that is accompanied by the highest positive price changes and low price volatility
    • They also have the highest cash ratio, indicating greater liquidity and the ability of companies to cover their short-term obligations using only cash and cash equivalents
    • They also have the highest P/B ratio, suggesting that the stock price for stocks in this cluster is trading at a premium to the company's book value
    • Therefore, investors who seek moderate risk and have the funds to invest, should go ahead with investing in securities from this cluster.
  • The next cluster (with 11 securities as per k-means and 9 securities as per hierarchical clustering) majorly consisted of stocks from Financials, Telecommunications Services and. Health Care sectors.

    • The stocks in this cluster are moderately priced, with positive price changes and low volatility
    • They also have a moderate to high cash ratio, indicating sufficient liquidity and the highest net income, suggesting considerable revenue beyond cash
    • However, investors will need to be mindful of undervalued stocks in their portfolio.
    • So, investors looking for an afforable investment option with high revenue and moderate risk can opt for securities from this cluster.

Clearly, CHANGE is the only constant in the stock market.

Also, while comparing the metric for a specific industry, an investor must be well aware of its ideal ratio values for differnt variables, e.g., Earnings Per Share, P/E ratio, P/B ratio, etc, as these variables are industry specific.

Lastly, while k-means and hierarchical clustering algorithms have yielded representative clusters, stock movements change daily and, therefore, Dynamic clustering would prove useful for further analysis of such data